GREG MITTLEIDER: Good evening, everybody. My name is Greg Mittleider, and on behalf of CS50 I would like to thank you for joining us, both in person and online for the final in our series of the 2017 tech talks. Tonight we are very fortunate to have the founder and CEO of Sentry, Mr. David Cramer. Tonight he's going to talk to us about the importance of shipping continuously and getting product to market as soon as possible. Without further ado, David Cramer. DAVID CRAMER: Thanks. So, disclaimer, I prepared this presentation as for a mixed background. So it doesn't go super high level, but it doesn't go super low level. Hopefully a lot of this applies when you're thinking about any principles, but it really applies to software engineering. As mentioned, I'm the founder and CEO of Sentry. I'm a software engineer by trade. Previously, I worked on infrastructure teams at Dropbox and Discuss. I'm sure you're all familiar with the companies. But they're fairly large technology companies, at least in terms of their footprint on the internet. And Dropbox, at the very least, is a very large company in general at this point. But suffice to say there's a lot of hard technology problems. And at both of these companies, I really honed in on things that revolve around infrastructure scale and developer productivity. So a lot of that is how do we make it easier to write software? How do we make it so we can ship software more quickly, and minimize mistakes? And so today we're going to talk a little bit about that. This is not really a talk about Sentry, though. This sort of aligns with what we do. It's talk about what we call continuous iteration. And I'll get into definitions of that here in a little bit. But let's start about software development flow. So, if you've been using computers for any period of time, you have an operating system. And operating systems are probably the classic version of-- I don't want to say archaic, they saw benefits-- but an old release cycle, an old workflow that isn't necessarily the right idea anymore. So, often what you have is something on the order of a-- these days it's probably a 6 to 12 month release cycle. And what that means is you have, so say it's a 12 month release cycle. You spend 12 months probably on development and then another N months on QA, and a bunch of manual human processes. If you look at Microsoft, they have dedicated teams that go in, and they actually test everything. And there's a lot of reasons they do this. One is for stability, but the reason they have to do this is because of that process. And if we look at where technology's gone, especially with the internet, things don't really work that way anymore. Even these big companies, like Microsoft and Apple, are trying to do more rapid, smaller iteration cycles. And we'll go into a lot of those reasons, and why that matters, and sort of how we address that today. But a quick high level, the biggest benefits we look at those old processes are the big stability increase. So if you look at a lot of web properties today-- I'm trying to think of a recent problem-- but I was just talking to David earlier about Twitter, in the day when the fail whale was famous, which is when you would load Twitter and it was always down, because they didn't have enough capacity. That's kind of one of these issues that happens when you change rapidly. Another might be just like introduction of bugs that are not well tested or something. But when you have these long processes, you can spend a lot of time on the QA phase, and you actually need to. And you primarily need to, because if there is an issue, it's going to take you quite a while to come out with a patch for that. We see this often in software these days with security concerns. There was a pretty important Wi-Fi security vulnerability, it was a few weeks back at this point, and the day it was announced, vendors still did not have a patch ready. And that's because they have to go through a pretty long process to vet these patches to make sure they're not going to cause other problems. And this is kind of where we see the flaw of this old paradigm. So today the focus is really going to be on how do we fix that, and why and how do we want to ship continuously? I think the why is kind of self-explanatory. But let's start with the obvious case in software development. So this is how we described Sentry early on when we did our first venture capitalists pitches. You have a product, Twitter is not a good example here, but we'll actually use Sentry. And you've launched a new version or you've released a change, and somebody on Twitter is going to start yelling at you. You may not pay attention to Twitter, but you probably should. And they're going to say it's broken, and that's all they're going to say. So what happens is, in a larger company-- and this was even true at Dropbox at the time-- you have separate departments. So, say you have a support department. And they're going to address that customer's concern first, but they don't really know anything the customer customer's talking about, unless it's a known issue. So the customer is going to be like, I can't load dropbox.com. And the support team is going to go through some steps, that's kind of standard, they're going go back and forth with the customer collecting information. And then, if they deem it important, so say if there's enough customers, they're going to circle that internally. So they might have to go to multiple teams. So, say Dropbox is owned by a web team. And then there's an API team that works with that. And does it affect the mobile team. But they're going to go in this loop of finding out who to talk to, what the problem is, and eventually addressing that problem. Somebody is probably going to ship that update to production to the customers, and hopefully it's fixed. But then there's sort of like this need to communicate that back to the customer as well. And this is sort of a more concrete illustration of the problem with that archaic lifecycle process. It's a very long, inefficient process. And really what you want to do is slim up this machine in a lot of ways. And, fortunately we're here at a CS class, and a lot of this comes down to technology and automation, and how we cut out people as much as possible, or at least repurpose those people. So a few things to know about, these terms change every few years, but there's three things that we think about when we talk about these subjects. One is continuous integration. This idea has been around for a long time. I know a lot of people are familiar with GitHub. It's a lot of this, you take a poll request, you say is it going to actually be able to be shipped, like merged in your core codebase. But, also part of that is can you verify that it's going to be good? Can you have automated tests, or even human tests, it actually doesn't matter, but really you want automation that says this change is actually good enough to move to the next stage. And this has been a big trend for probably the past decade, but now it's sort of getting required. And this is the biggest thing that everybody needs to be able to move more quickly. And even the big companies, this is why they're able to do larger updates much more quickly. It's just because this huge investment in technology around automating this process. Continuous deployment is sort of one of these newer concepts. The simplest version of this is anybody at Sentry can go into a tool we have internally and they can click deploy. And that will take whatever is past the continuous integration phase, and it will deploy it to our customers in a matter of minutes. So it's not about always deploying everything, but it's about deploying anything when you deem it's ready and it's safe. So, it's giving ownership of that, and it's really automating that process and making it quick and painless. And then the last one-- observability, which really, at the end of today just means monitoring. This is kind of a newer industry trend where you build something that you're going to have some kind of user of. You don't want to launch it to your users or your customers, and then you have that Twitter scenario where it's broken, and then be like, well, I don't know. I didn't actually instrument anything. And I didn't have any monitoring set up. So observability is the idea that you do that from day one. So before you ever ship anything to your customers or have any users, you build in this instrumentation or this monitoring. But otherwise it's just the same thing at the end of the day. When we're talking about doing this continuously, it's this feedback lifecycle that we want to shorten. We go back to that picture. And it's like, the customer has an issue, and it gets fixed. And the customer's aware of that fix. How do we make that as quick as possible? And that's, really at the end of the day, that's why we want this fast life cycle. So there's some things if we're looking to implement this in a system. The first and foremost is that it has to be very fast to deliver these changes, and it has to be incremental. The best example I can give you-- and I don't know how much you all do peer review-- but if you've ever tried reviewing a 5,000 line change set, it's really, really hard. And it's hard because it's a lot of information, and you just don't have context. And, honestly, you're never going to be able to safely review it. So, part of this is minimizing what the component is that's changing. And this is valuable in so many things. And the way we sort of approach this, a measure for that, is we say, "Well, we need to be able to release changes at least daily." And that's sort of a gold standard for any major company. I think even Facebook these days still does daily releases. But they do it sort of on a weekly life cycle, where they are shipping things every day, but they rotate that to customers seven days later. And Dropbox also had a daily life cycle, with a similar paradigm. Sentry, we probably ship changes to production 20, 30 times a day, but it's a lot easier. We're only like 60 people. Things scale a little bit better at that size. The other thing is, when we're doing this, we want to keep the changes small, which helps because it minimizes the impact. But we still want to maintain that high quality by moving fast. Because, as an engineer, the thing you really want to do is just write more code, and sort of move forward with the project, but you don't want to constantly be shipping bugs to your customer. So, how do you ship this new update, but minimize the impact? And that comes down to automated testing and systems around that. To make this scalable, again automation. We need to minimize the humans involved. We don't actually do any manual QA at Sentry. I don't think we're nearly big enough for that to matter. But we've also spent a lot of time on automation, so we don't really need those kinds of processes. And then the last one, which is sort of a challenge, is you often-- if you look at Mac OS today, whenever it updates, there's nothing really exciting about it. But if you look back at Windows, when it was like Windows 98 to 2000 was the next one, or 2000 to-- I think there was something in between that wasn't great-- but, Windows 2000, Windows 7, was a massive leap in technology, and product, and everything. And it was this really big splash. And you don't really have that anymore, because you're doing these smaller changes. You can still do this, but it's a little bit trickier and you have to invest in technology. I'm not actually going to cover that in this presentation. If you're curious, I can talk through how you go about it. But it often isn't a very important thing anymore. So this is our objective. These are the soft level requirements, if you will. And the biggest thing here is this is going to be riskier than doing that large release cycle, but we can minimize all those risks. To give you the best example of this, Sentry's core technology basically captures errors. And it doesn't really stop you from creating bugs, but what it does is tell you about them within seconds. And if you have a good process like this, you can actually go fix the issue and ship the update. So it's limited the scope of the impact on your customers. And there's a lot of other ways you can minimize that risk as well. But you just have to accept that it's going to be there. And this is a principle we think a lot about. I was trying to find this quote, and I couldn't dig it up. But I read something from, I think it's Paul Buchheit from Y Combinator, that was, it's something about this like 90% of your customers-- focusing on what the 90% needs and ignoring the 10%. You can sort of treat software in the same way. And we can take the risk in the same way. What's going to be 90% safe? And maybe there's this 10% risk that it's going to cause a problem, but we haven't fully vetted that. If we can minimize that, and be able to react quickly, that's actually a totally acceptable risk. And we sort of apply that rule on everything. What's the 90% good enough feature? What's the 90% good enough technical solution to this problem? OK, so the four key components of this process-- and you can probably distill this down in different ways-- but these are what we see as the major principles behind this. So, integration we talked about. This often takes the form of manual QA or automated QA. And it can happen at different stages, but in this stage we're talking about before we've shipped it to any kind of customer. Deployment. It's very important, and this is repeatable. If you've played with Heroku-- which, I was talking to David, and he mentioned you might in this class-- things like Heroku make it so you have a change set and your build, what you're shipping to customers, is always the same for that change set. So if you go back, and you ship an older version of your product, it's still the same old version that it was, even though you're shipping it six months later. Monitoring. There's a lot of stuff in here. We like to focus on what matters to application developers and not systems. So, in Sentry we have a small team of operations engineers. And they're responsible for just maintaining servers, at the end of the day. And they monitor things like, do we have enough disk space? Is everything online? Is a machine overloaded on CPU? But when we're developing product, we don't really care about that, and we shouldn't have to. I'll talk a lot about tools. And there's a lot of systems that are trying to take that off of our hands, which is great, because it allows us to focus on more relevant things. So, whatever our goals are, versus just reimplementing everything under the sun. And lastly is the feedback phase. And the analogy I was using before, Twitter, is a feedback mechanism. It's not a great one, but it's one where your customers will give you feedback whether it's, we love your product, or this is totally broken. The big thing is-- and this is the gray area in technology these days. Sentry fits in this world, but there's not a lot that is active feedback where it's automated. It's automated user feedback in a way. So we'll talk about some of those, too. But it's the last key important piece of this. And, OK good. I don't actually remember what order my slides are in. But this is how we think about the workflow, in what it looks like in any major technology company today. So, you're familiar with a lot of this stuff. But you're going to make some changes to the code. It's going to go through a testing phase. You're going to deploy it. And this phase actually might repeat. So you might go through a testing phase, deploy it to 10% of your audience. And then there's another automated phase that checks if it can be deployed to a larger set of your audience. Those are often very expensive processes, so you only find them at like Facebook and Google, and companies that have a lot of engineering time to invest in it. But then we go through what we consider sort of this automated monitoring QA phase, which is where things like Sentry, or any other monitoring, uptime monitoring might come into play. What you do with that information is up to you. But in the ideal world, you create a triage resolution phase, which actually might encompass this whole thing, where you're going back, fixing issues. And eventually we get to the success phase. And the success phase is also where we think about feedback. So, a good example of this is if you sign up for Sentry, and you hit an error. We track this in the system, and we have an internal SLA where we say we try to deal with that error in the same day. And as soon as we deal with that error, we go through this phase of pushing out a fix for it. And then, when we hit the success phase, what we actually do is reach out to any customer who was affected, because our systems actually track them all. And we're like, hey, you know this error you hit that you didn't tell us about? We fixed it. And this is kind of the newer train of thought that doesn't exist a lot in software. Often the only time you're going to talk to anybody from a company, is if you proactively reach out. OK, so, we're going to dive into each of those phases a little bit. My goal with each of these is not so much implementation, but I want to cover high-level what we look for. And I want to talk about some of the tools that you can use. A big part of what we think of in the industry now is just how do we do less work. So, we don't want to run servers. We don't want to build our own monitoring. We want to use the best solutions for each thing and piece them together. So that's what we're going to go over here. The first one is integration. And this is arguably the most important thing. It's very hard, it's a very rigorous thing to get into play, doing automated testing, and writing very high quality tests. I can tell you, I've been doing software engineering for 15 years, and I don't think I've ever been in a company where this is a really solved problem. Even at Sentry, we do a lot of testing, and we still ship bugs every single day to production. But what this comes down to is a change control process. And, if you do an internship at a big company, you're going to be forced to deal with this. Startups, they often don't have a lot of this. But it comes down to something like, you're going to go on GitHub, you're going to propose a change through a pull request. You'll often have peer review of that change. That review kind of varies. Sometimes it's technical feedback. It's like, no, you should actually write the code this way. But oftentimes, it's design feedback. Or it's like, hey, did you think about this case? And that's really actually what we care about, because in parallel, often you have this automated verification phase. And that's actually going to test, is the code good? Is it valid? Is it correct? And this is honestly where we spend the most investment from engineering resources. And then the last one, which is sort of a culmination of all this automation, is can we ship this? And that's what they call integration. But we just call it "can we merge it?" which comes from a Git term, but it's like, can this go to production at this point? So a little bit in each of those. So, pretty much everybody I think is probably familiar with GitHub at this point. It's sort of the canonical solution for just doing any kind of code management. But when you are at Sentry, or many companies, what you are required to do is create a change request just like this. Outline what's going on. This is actually a change to Sentry. Our whole thing is open source, so you can actually dig this up if you want to. But it's going to go through this review phase, where often we're outlining what the design is for this. Sometimes it's as short as just a line, but oftentimes it's more complex and we go into details. And then we actually mandate that somebody reviews your code. The review may not provide any value at all, but it's one of those checks and balances that's kind of a lightweight cost that continue to iterate quickly. From there, there's a lot of automation that's built on top of this. And the great thing about GitHub is it works with a lot of these other technologies, these open source tools. I'll talk about some of those. I mentioned peer review already. But we sort of lump these in verification. And Sentry actually relies on a lot of these. So, I don't talk about each of these specifically in here. I'm happy to dig into them. But we do something that does visual testing. So it actually renders pages of our website, and we'll know if there's a difference. We do things that say, "Are there enough tests covering this code?" For example, if you were to change our billing code, we actually require that every line of code is run by tests. We have our standard just, did the tests pass? We have another set of visual testing in here. And then we actually have a security test as well, which just says, vulnerabilities introduced, or are there dependencies being added here that are a concern? And every single one of these is provided by a third party, and they just all integrate with GitHub very seamlessly. Most of these you can also use for free or dirt cheap, especially if you're just working on a side product that's very low volume. Sort of what we think about in the tooling chain, so the big thing here is, sit infrastructure that's going to run whatever you tell it. So there's a lot of services that I showed back there, but those are just going around on top of these systems. So, what we strongly suggest is exploring GitHub and Travis CI. They pair together really, really nicely. The learning curve is not huge. But it's important to know there are a lot of tools in this industry. And this is, honestly, this is just the four I can remember off the top of my head. There's probably like 40 of them that just run say, your tests. Some of these are going to be things you have to run yourself. Jenkins is an older Java technology that I definitely do not recommend you try to set up. But it's tried and true. And a lot of companies use it. Whereas things like Codeship, and Circle, and Travis, and GitLab are newer, often SaaS services that try to be drop in, where you don't have to learn a lot and you can just wire this up to your projects. But yeah. Generally, explore this area. I truly think this is the most important thing you can commit yourself to in software engineering, is getting over this hurdle of we have to spend time on this integration phase. And actually, a quick aside, I joined Discuss about, I don't know, seven or eight years ago, and there were like 10 people on the team. And at the time I think we had something around the order of like 100 million page views, we call it. So when somebody is loading something through our network-- I forget if that was a day-- but it was a significant amount. And the first week I'm there, I took down everything. I shipped a bug. And I shipped a bug because they had none of this in place. They had some automated tests that just never ran, so you had to run them manually. They didn't really have a lot of monitoring. I knew about these concepts, but I never really grasped how important they could be. Because, you can go in and be like, "Oh yeah, I tested my code." And that's easy if you have a very small app, but as soon as your change is affecting a large complex application, you're just not going to know what's happening. So a lot of what we build here is future-proof tests. I'm often not writing tests to say, the code I'm building right now is correct. What I'm doing is writing tests so when somebody changes that code, or changes how it interacts, those tests will fail and prevent somebody from causing some other significant issue. So this is a very, very important thing. There's a lot of technology out there, thankfully. But it is still very manual. And, the unfortunate part about at least this area of technology, is it changes a lot. I think even Sentry, we've gone through like four iterations in three years of what is a good system for these kinds of things. This is just an example of how you might configure Travis. There's a bunch of links, and I think you all have access to the slides afterwards. But this is a truncated version of one of our configurations. But you basically just say, OK, I'm going to run this language, which is Python in this case, I need a couple database servers running. And here's how I'm going to install my dependencies. And here's how I'm going to run tests. And you don't have to run tests, you can do whatever you want. But in this case we're just like using a standard testing framework. The way we can solve dependencies is just a standard way. So there's nothing really out of the norm. So it's often very easy to get started. The next big thing is deployment. And this is, unfortunately, the area that there's not a lot of technology around. But let's talk about what we look for here. So number one is we want to offload control. So, if you go back, even five years ago, and even today at a lot of companies, there is what's called a release manager. And this person is often designated as the person that presses the deploy button, or ships code. And I remember at, I think even at Dropbox, there was a daily check in. And so these changes are all going to go live today. Are you here to sign off on your change going live? And then somebody hits the button. And the person is often a system admin, or an operations engineer. And the idea of continuous deployment is that goes away. We want you, the author of the change, to be able to ship your change, and be fully responsible and accountable for your change. That's like the distillation of this idea. But really, what it actually looks like in practice is, oh you're the team that's responsible for the API. You should be able to run the API yourselves. We shouldn't have to manage that as a larger company. So when you look at big companies now, what they're doing is they're building teams that all they do is build this platform layer. And the platform layer is intended to let you run your own test. It's intended to let you deploy your own code, monitor your own code. And they no longer have to act as the source of truth for our thing's good. And the good thing about this is that was it was always sort of a thing at the biggest companies in the world-- Google has probably had this for 15 years-- but now like even the tiniest companies have this kind of capability. So giving ownership is important. The repeatable builds is really important. And this is just sort of a thing to think about, if you ever go down this world because you'll often find that when you're building software now, you're using a whole mess of dependencies. And those dependencies are not very well controlled. And so you'll have something like, a version changes, or a dependency's dependency changes. And if you were to say, rebuild the same version of your app that uses those dependencies, and something changed under the hood, all of a sudden your app just might not function. And there was actually a very famous version of this-- I don't remember when this was. Like a year, maybe two ago. It's called left-pad if you Google for it. But somebody unpublished a dependency that was used by everything on the internet, and everybody's builds broke. It actually prevented a lot of companies from being able to deploy software for a day or something. So it's a very important thing, that until it bites you, you don't really care about. The next one is very hard I would say. Rollout strategies often don't exist at small companies. We don't really have one. The simplest version of a rollout strategy is it says, I'm going to deploy this to a staging environment first, and I'm going to verify it's working there, whether I manually visit, or I automatically check it. That's kind of an OK version of it, but it doesn't really scale and it's not all that useful. So we have that but we don't use it. We don't find value in it. But going back to Dropbox or Facebook, you're going to go in and you're going to say, OK, ha my change has been released today. But actually what's happened is Facebook employees are the only people that see my change. And they're using my change for a week before any customers see my change. And then maybe in a week it only goes in 1% of customers, or something. And in another week maybe it goes to 10%. So that's what I mean by a rollout strategy. And what that does, again it minimizes risk, and then you can often back out from that. So especially if you have a big enough company, if it's your employees that are hitting the bug, it's actually not really a big deal. They're invested in this. But you can't do that at a small organization, either. So there is a lot of technology that's being built to make this a little bit more approachable. But I think it's much too complex of a subject, and it's not something you're ever going to want to build yourself, so I'm only going to lightly touch on it. And then the last thing-- and this sort circles back to our change control process-- is not allowing anybody to circumvent integration. So, for a lot of companies, including ourselves, we have compliance regulation that says you cannot deploy any changes that circumvent the change control process. And basically all that says is you can not deploy a change to production that has not passed all the tests, and has not passed whatever those verification techniques you had were. And that's actually a pretty easy thing to get in place, but it does require rigor because you're like, the test is failing because, I don't know, it's a leap year, or something like that. Or the test is failing because it's just flaky. But in our world, we actually just have to accept that, and we say it doesn't matter. No matter what, we have to say, this blocks unless it's green. And so it ends up causing more time consumption, but overall it helps with general happiness. So this is kind of our agenda. The first thing I want to talk about is control. So, I was actually talking to David about this earlier. He mentioned that he talks a lot about Heroku and stuff, which I think is a great platform that gives control. You don't have to learn a lot, you can just deploy projects on it. This week was Hack Week at Sentry, and we actually used Firebase, which is owned by Google, to actually do a lot of prototypes. And it's another one of those kind of systems, it just provides even more for you. So in this case, I'm just using the Firebase command line to deploy this Hack Week project, and it's no different than Heroku. I'm able to create a project, spin it up. And in this case, I'm actually just deploying a JavaScript application with nothing else. But it's important that it's just giving it to me, that I can do it. I don't have to ask somebody else. Somebody else isn't getting a phone call in the middle of the night if I can't deploy because I need to ship a bug fix. It's really all on me at the end of the day. And now this works for a side project. But as soon as you are inside of a company, you don't really want anybody doing this. You want to have this controlled through some mechanism that ensures they can deploy, because-- One thing you'll note here, I'm actually not doing any kind of proper change control. This is me just saying, whatever's in master, I'm going to deploy it. I'm not saying, this build is green. It's passed tests. I don't know even know if I have tests on this. And so, it doesn't really fit all of our requirements. So, what you'll often find, and unfortunately this is one of those areas that there's not a lot of technology around, is a company will have their own system for controlling how these deploys go out. The old school way is it's owned by a person and they press the button. But in Sentry's case, we actually built something which anybody in the company-- I think even like if you're an accountant, or an office manager of the company-- I think you could technically go in here and deploy Sentry. But you just have any service in here. You can deploy it, even if it's not your service, which isn't great but it's good enough. This is actually open source but I wouldn't worry about it. I think it Heroku has a little bit of this built in. But again, at the end of the day, this is often a company specific, because the way you ship code to production is often different between each company. Just because it's often very coupled to how your servers function. The rollouts component. So again, I mentioned, there's not a lot of great technology here. I do want to highlight one piece. I only recommend exploring this if you're very, very into operations and systems. But there's a technology called Kubernetes. It's very interesting, very compelling around this. It makes it very easy to stage rollouts. So you could say, "I'm going to quickly spin up 10 copies of the new version of the app, and as soon as those are working well I'm going to tear down the 10 old copies, and I'm just going to cut everything over." We call that a blue-green build. But Kubernetes is a nice technology that enables that. And it's very accessible on platforms like Google, their cloud platform. But it is a little bit tricky. As somebody who's been doing this for a very long time, I've never been a classic systems admin, but I sort of know my way around. And I sort have this side project, and it took me far too long, like embarrassingly long, to figure out how to make anything work on this platform. But it is very good. AUDIENCE: It's the same thing with Docker Compose. DAVID CRAMER: It is. It's very complex, but it's got a lot of good ideas. But it's often overwhelming. And Kubernetes is all built on top of Docker, so it's a good, like, connecting the dots together. You know, again, going into the ecosystem, there's kind of two ways we look at this. There's one, there's an abstraction of machines. We call that infrastructure as a service. It's AWS, it's Google, it's Azure. They're all good. It's, like, pick your poison here. We actually use Google for everything. We just transitioned all of our hardware there. I will say I'm very, very happy. Without picking sides, I think Google is the best. And it's a lot more accessible, which is nice. Amazon, if you've ever tried to understand what all their products do, it'll probably quickly turn you off, because there's just too much going on. And it's not very user friendly, whereas Google doesn't really give you much. They have servers, they have cloud storage, they have the whole Firebase platform, but there's not like 10 services that you don't really understand what the name of the service translates to. So it's kind of a nice, middle ground. I haven't personally used Azure. I'm sure it's also good, but it often caters to a different kind of technology. But then we go back up to what we consider more of a platform as a service which owns more of this. A lot of these are honestly built on top of AWS, or something else. Firebase, I think is great, but it's probably better if you're trying to build a JavaScript application. Heroku is really good, but it doesn't do a lot for you anymore. Zeit is just a newer one that we know the people behind. And it's kind of another JavaScript interesting platform, where it's sort of if you build your application this specific way, you don't have to worry about any of the infrastructure. And that's a really nice thing to get around. OK. So, the third step is observability, monitoring invisibility, whatever you want to call this phase of it. But knowing when things go wrong. So, again, the goal is we do this from day one. Ideally, it doesn't take us much time, but we start with monitoring built in, so we don't have to add it when there's a problem later. Honestly, if that's all you take away from this, just drop in some third party service to whatever you're shipping out there, it's the best thing. There's a few important things that we care about. Like, it needs to be real time. And when we say real time, we need to know about something-- ideally it's even better than a minute resolution. We want to know within like 10 seconds. And we want to know within seconds, because if we can know within seconds, we can automate it. So, for example, we could deploy a change to production, and Sentry could tell us that there's an increase in errors. I'm not saying we do this, but we could. And then we can say, OK that's enough data that we're going to either roll back that change or promote the change to the next lev And that's why I like the real time matters. Graphs and logs are kind of a classical solution to things. And they're very useful, but they often don't tell you what's wrong. They tell you a lot of symptoms, but they don't go into what the root cause is. And the way we look at technology now, and the way we build Sentry is, we say, what is it that we actually are trying to fix? If we're going to go in and address a problem, what is the actual problem we're going to need to identify? And so for Sentry, we say, let's identify that problem and tell people about it, rather than giving them graphs that says, like, you're running out of memory, or there's a lot of errors, or something along those lines. But there are some good systems out there that will take all this data and look for anomalies. That can often help with more complex problems. And then last, and I think the most important, change that's happened over the years here, is there's a drastic shift between what we consider monitoring, where there's application, and there's also systems monitoring. And systems monitoring is what has always existed, and that's really where the graphs and logs come in. Application monitoring is what we consider Sentry. It's what, if you've ever looked at New Relic, a lot of it's app-level monitoring. It tells you about what's wrong with your code, not sort of what it looks like on the system level. And this is where you really get deep insight, because often these systems are living inside of the code. So they can actually see full source code, they can know-- for example, Sentry knows who the user is that's acting on a request. It knows, in a lot of cases, in like Python, for example, we can tell you what variables are assigned, and things like that. And all of a sudden, we have a lot more context to dig into an issue, and you'll never get that with systems monitoring. My goal with this was to sort of overwhelm you with a lot of options. There's way more than this. I tried to eliminate any that I don't think are interesting anymore. You kind of have two categories. And there's a chunk that are open source, which often means run yourself. And there's a chunk that are cloud services. And they often still have a free tier so there's enough to get started. Personally we only use Datadog in this list, I believe, and Sentry, of course. I guess we use Elastic for some things. But all of these sort of solve different constraints. So if you truly need an encompassing solution, which, if you're just building a hobby project or a small thing, you're not going to need most of this. But if you need to cover all the bases, it's going to be, like, pick five of these, or something. A few quick highlights. Prometheus and Datadog are both systems monitoring. They provide a lot of graphs and things like that. Datadog I think is actually very, very compelling. It's a very cool tool. Primarily you can do things like, I want to send a metric that says, here's my response time. But then I want to say, the response time is actually for this end point. And I can actually see overall response time. I can see the top 10 slowest end points, things like that. So it's like a nice spin on the classic systems monitoring. Some of these are what we would call APM, which I believe is application performance monitoring. But it's focused on performance at the end of the day. And that's what New Relic is. It's kind of what Zipkin is. It's a gray area on some of these others. Scout and Datadog dog both provide some of that, but it's really intended to say, my code is slow, why? Or, my network is slow, why? Which is an important problem, once you have different services communicating with each other. A few of these are logs. Papertrail is something we used to use, which is a very, I think they have a free tier, but it's a very cheap, cloud-based log solution, that you can literally just view the stream of your log instead of having to jump through a bunch of hurdles to see what's going on. Elastic is another version of that that's a little bit more complex. Stackdriver is any number of things. I think they do a little bit of what Sentry does. I think they do a little bit of what Zipkin does, a little bit of what New Relic does. But you'll often find, if something tries to do a lot of things, it doesn't do any of them very well. I'm not saying it's bad, I'm just saying that's the caveat you get with these. OpenTSDB is similar to Datadog as well. There's a mix of things here, I guess is the important point, that some are systems, some are application, but very few of these focus on your application. I would say it's Sentry, It's New Relic, and then some of these have a couple features that blend in. The main thing here is just pick a service that solves the problem the best, and just use that service. Don't worry about if you need three different ones. We often have a lot of people that come to us, and they're like, well, we want to use Sentry. But if we use Sentry do we still need New Relic? And it's like, yes. If we use Sentry, do we still need Elastic or Login? It's like, yes. It doesn't solve the same problems. It solves newer problems that we've created for ourselves. The other important thing-- root cause analysis. So this is actually a slide, I forget which. We put this together to describe what Sentry could be in a near term future where you have a complex application, no matter what these days, where at the very least, you have a website, or a mobile app, and then you have a server component that's communicating with it. And so now all of a sudden you have a distributed application. In the simplistic form is, I'm loading my mobile app. I click a button. It hits an API service. The API service errors. In the mobile app I see an error, in the API service I see an error, but is that enough context on their own? It often is, but it might help if I'm looking at the API call and know actually what happened in the mobile app. So we think of this as tracing, but it's high level. And this is again a very simplistic version of this. At larger services these days, this could be a hundred different things that are communicating with each other. But the main thing is here, what is actually the cause? Clicking a button in the mobile app might be the cause. The cause might be deep in the API. But we really want to distill down, whatever the error we saw, what caused it? And really we want to know what's the code that caused it. And when did it start happening? And when did the change set get introduced? OK. The last major thing, feedback. You can take this any way you think about it. Customer support, monitoring, it sort of all blends together in here. This is the area that I don't think there's been a lot of development in. Customer support still works the same way it did, probably 30 years ago. You email somebody, they email you back. It's probably a very bad process most of the time. But there's been a little bit of progress. Sentry does a little bit in this world. But let's go again, what are we looking for? So, we think of this like active and reactive. Active being, we want to know what's going on before somebody tells us about it. So that goes into the monitoring world. And reactive being, somebody sent us an email, or somebody complained on Twitter. It's sort of one of those things where if it's gotten to that stage, it's not ideal. I remember, when I was 15 there was some story about customer service that I'm just going to assume is fact, where it's like 9 out of 10 customers won't complain. And if that's any kind of statistic, that could be like 90 out of 100. Could be 99 out of 100 for all we know. And so if you're relying on that kind of feedback, It's not really good. And that's why Sentry focused on error monitoring so much, because errors are often the first sign of a real customer problem. So we want all the active, arguably everything we can, we want automated just like this whole process. And reactive, really what we want is not somebody raging on Twitter about their United Flight. We want them directly talking to United. So, you want to ask people for it. Say, like, whenever you have an opportunity-- and actually I'll show you some examples of this-- but you want to engage this like, will you give me feedback. It's not really different than anything else in life. So Sentry has kind of grown very fast. And we're doing our first peer reviews in the company. And really it's just, say, give me feedback. You're probably not going to give me the feedback day to day, but I am asking you to give me feedback so I can be better. It's the same thing when you're doing engineering. Another thing-- and this is something I think people ignored for a long time-- is context is very important. And that's why a lot of this automation helps, because if we can automate collecting this feedback, we can automate collecting context. And the classic example is, customer writes in about a bug in your application and you're like, what browser are you using, or what operating system are you using? And that's such a simple question. Even if you just have a form that says tell me the browser operating system. But really, you don't need the form anymore. We have enough technology. It's really easy to collect this data. And so that's one thing Sentry does very well. We just collect all the data we possibly can. And I call out here that you probably actually don't need to do a lot here. If you're a big business, this obviously matters a lot. If it's a hobby project, who's going to be mad at you? Your users? They probably don't pay you any money. If they do, you can do a little bit here. I would argue something like Sentry is all you really need to get off the ground. I was talking to David earlier and we were talking a little bit about what we do day to day to monitor our systems. And my analogy was I haven't had to look at any tool besides Sentry-- and I don't do development every day anymore-- but I haven't had to look at any other tool for years at this point. I don't have to look at system logs anywhere. I don't really look at graphs anymore. Sentry solves the major concerns I have, and those are the concerns of the end user. If the end user is not seeing a problem, does it really exist? So going to the reactive feedback. I think this is the key thing. So, this is actually a Sentry feature here, but you've had a browser crash or a mobile app crash. You open it back up and it's like, hey it crashed do you want to let us know anything about it? Now, the truth is, you don't need any of that information. You already have the details from like that crash report, for example. The reason we do it, is because we feel there's a stronger emotional connection-- like a business connection-- that if the user feels like somebody is aware of that problem, they're probably going to be far less angry about that poor experience. But this is no different than a customer support form. Or, imagine your flight gets delayed and you're really mad, but when you're notified that your flight is delayed, the airline's like, hey, we're sorry about it. Can we help you with something? Instead of just telling you the flight's delayed. You want to ask for that feedback. Again, if you can automate it, that's great. In this case, we've automatically sent a crash report, which is just like error logging. And if they add this information, we'll just associate it with that report. So we've still automatically collected that feedback, and we've automatically enriched it with context, but this is just taking it another level. Another interesting way you'll get a lot of this is if you're doing any open source work. And I highly recommend just open source everything, unless you have a reason to keep it private, especially if it's a side project. Put it behind a protected license. It doesn't really matter. But we actually get a lot of people, some good, some bad, submitting feedback via our GitHub issue trackers. And this is actually one that was-- I mean, I took this screen yesterday, so it was submitted like this week, effectively-- and it's just a problem they had with the SDK. And they probably pay us money. And they're still willing to open this GitHub issue, which is really, really nice. And it's just another channel that we make accessible to allow them to give us feedback. Again, as often as possible, how can we make the experience good for them, when usually in these situations it's really bad. This is part of a Sentry crash report. We have the basic things I talked about. We know the browser, we know the operating system. In that case we know the user, which is actually super important for a lot of things. This might be what a log message looks like. It's not really useful. We just log when there's a billing failure, because often this could be caused, it won't actually throw an error, but it is an error. But we actually do a bunch more in this case. We're actually collecting a bunch of events that led up to this error. And I will argue that most of these have not actually been valuable, we don't need them to solve the problem. But every so often we're like, oh, this is how it happened, or oh, this is how we triggered that problem. And it sort of lets us accelerate that process of resolving the issue and again, going back to the customer. Again, this is the only goal I think with software, ever. There are going to be, some of you, I'm sure, will actually go into technology build useful things in the world, but most of us are going to build registration forms. That's what I always joke about at companies. But the reality is, a lot of the problems are not very, very hard. And a lot of it's just being able to iterate very quickly, and get through the process efficiently. And this is actually an idea we proposed to one of our investors at some point. You have a user that hits an issue, Sentry's going to notify you via Slack or whatever your poison is. We're going to fix it. And then, what if we just email the customer automatically that the bug was fixed. We don't do that because there's like some concerns around it, but it's an interesting idea. Any human that was doing those processes could be doing something else instead. One way we think about this is, you often have large customer support teams at companies, and those customer support are reactive feedback. But you also have teams that you would describe as what they call customer success. And that's more active. It's like account management. And so what if every customer support person could be doing active account management, instead of responding to bugs all the time. It feels like we'd all be a little bit more successful. There's a lot of tools in this space as well. There's different approaches here, so that's the main thing. I would say these tools are not very well related to each other. So Zendesk is a big support application. You write an email, it goes into a ticket tracker basically. It looks like GitHub issues. FullStory actually does visual, sort of, what's your user doing on your website. So, maybe you're releasing a feature, and you're not sure why anybody's not using it. FullStory might be able to paint you a picture of what the users are actually doing. They're going from this page, to this page, and they're actually not even scrolling down like this far on the page. But it's really designed to give you a better visual representation of what's going on. And these are often harder areas to automate. You have to give them business logic to say, well, what I care about is this goal. And then, why is it not being hit, is very subjective. So you'll find in the feedback landscape, and you know again this blends with monitoring, some of this is going to be hands on, and maybe one day we'll solve some of that. But there's still a lot of work to do there. Sentry, fundamentally, all we do is tell you when your code's broken. There's a little bit outside of that, but that's at the core of it. Intercom is kind of a blend of Zendesk. But it also lets you do proactive messaging. So Intercom's almost like a chat widget on a website. But it does a few things with that kind of functionality. But again, it garnishes reactive feedback. So it's always there, somebody can just click it, and they can type something in right away. But you can also do active requesteds feedback in there. So you can say, I want to send a message to everybody. I want to tell them about this new product, or this new feature of my product. And it'll blast anybody that loads your application. And then, again, you can engage more and get more feedback through that mechanism. And then the inevitable, somebody is going to email you and be mad, or tweet, or Facebook, or whatever. I don't know, do they Snapchat it these days? I'm not really sure. If not yet, soon, I'm sure. But there's a lot of different ways you can collect this feedback. I think the big thing that hasn't happened in the industry, like I was saying, is there needs to be more automation around this. Even Sentry's automation around feedback is just, it's just a little bit. We've just barely touched this kind of market. OK. The last little bit, and I'm not going to dive into this, because I don't want to be super technical, but if you're really looking to see how something simple could be wired up, this is actually an application-- I think it took me two days last week. All it's plugged into is Firebase. I actually didn't even have Travis configured the other day. I think I did that after I made this slide. And it's connected to Sentry, so we can know if there's any kind of obvious thing that went wrong. And what it is, and I actually don't have network working on this computer right now, but we're doing Hack Week at Sentry right now, which is no different than a hackathon. It's basically, we say anybody that's in our engineering organization can do whatever the hell they want for-- unfortunately only four days this week-- but for the entire week. And we wanted a product registry. So we're like, OK, what's the quickest way we could get something out the door so we could start collecting this data. So we explored and used Firebase. And you can actually load this up if you want. I'm pretty sure I made it open source today. If not, I can fix that. But it's just using a simple Firebase app. It's using boilerplate technology. It's using React which is, if you're not familiar, it's a very good mechanism for building a very responsive, single page application. It's using Firebase's database on the back end, which is actually like, it's fully client connected. So, when you load this application, it's actually running entirely in your browser. There's no server that it's communicating with, other than Firebase. So it's kind of a neat way to prototype something. But it's also wired up to Travis. You can sort of see how the lifecycle would work in there. There's probably next to nothing in the reading, of course. But if you're interested, explore it. There's a lot of other stuff that sort of conveys some of these ideas on our GitHub. I mentioned one of the side projects, we have a project called Zeus on GitHub, which sort of is an implementation of reporting on top of continuous integration. But again, All open source. But yes I think that is it.