Production Oriented Development

2019-12-30

Throughout my career, I’ve developed some opinions. Some have worn particularly deep ruts, reinforced by years of experience. I tried to figure out what these had in common, and it’s the idea that code in production is the only code that matters. Staging doesn’t matter, code on your laptop doesn’t matter, QA doesn’t matter, only production matters. Everything else is debt.

This perspective probably comes from years sitting in between operations and product development. I strongly believe that teams should optimize for getting code to production as quickly as possible as well as responding to incidents in production.

This idea, and a lot of the practices it implies, can be counter-intuitive or controversial, so I want to dive into them a little further. What follows is a set of practices and principles I believe are true, considering my underlying belief that code working in production is the only code that matters.

1. Engineers should operate their code.

Engineers are the subject matter experts for the code they write and should be responsible for operating it in production. In this context, “operating” means deploying, instrumenting, and monitoring code as well as helping to resolve incidents related to or impacting that code. The responsibility of operating code aligns incentives - it encourages engineers to write code that is observable and easy to debug, and connects them to what customers really care about. It encourages them to be curious about how their code is performing in production. Importantly, engineers should be on-call for their code - being on-call creates a positive feedback loop and makes it easier to know if their efforts in writing production-ready code are paying off. I’ve heard people complain about the prospect of being on-call, so I’ll just ask this: if you’re not on-call for your code, who is?

If you’re not currently on-call for your code but want to be, and can help influence this decision, there are some things you can do. Set up PagerDuty (or similar) schedules for each group of engineers responsible for specific services or parts of your code. A good schedule has 6–8 engineers. There are plenty of variations, but a typical template is to have one-week rotations, where you’ll be on-call for secondary for a week and then primary for a week. Configuring alerts is a separate topic, which probably deserves it’s own blog post entirely, but focus on things that impact your customers (see: Symptom-based alerting) and remember that you’re ultimately responsible for how you respond to alerts, which means you can change them.

There are two talks I’d recommend watching that touch on the topic of configuring alerts: Liz Fong-Jones talks about SLOs in Cultivating Production Excellence and Aditya Mukerjee does a great job talking about techniques for managing alerts in Warning: This Talk Contains Content Known to the State of California to Reduce Alert Fatigue.

2. Buy Almost Always Beats Build

If you can avoid building something, you should. Code is the most expensive way to solve a problem that isn’t addressing a core area of your business. For most small to mid-sized companies, there are open source or better yet, hosted solutions that solve a wide range of common problems. I mean things like git repository hosting (Github, Gitlab, Bitbucket, etc), observability tooling (Honeycomb, Lightstep, etc), managed databases (Amazon RDS, Confluent Kafka, etc), alerting (PagerDuty, OpsGenie, etc) and a whole host of other commodity technologies. This even applies to your infrastructure - if you can help it, don’t roll your own Kubernetes clusters (side note: do you even need to use Kubernetes?), don’t roll your own load balancers if you can use Amazon ELB or ALBs.

Unfortunately, NIH syndrome is very real and some companies get burned badly by this. I’ve seen teams light time and money on fire reinventing components when better, more battle-tested alternatives exist in the market. Those same teams almost always end up spending years contending with the resulting technical debt. If you’re on such a team and have the will and ability to impact change, start rolling back these decisions one by one. Migrate your databases to a managed provider, migrate your feature flagging system to a SaaS tool (i.e. LaunchDarkly). Keep going until the only software you maintain yourselves is the software that delivers value to your customers. You’ll be much, much better off for it.

3. Make Deploys Easy

Deploying should be a frequent and unexciting activity. Engineers should be able to deploy with minimal manual steps and it should be easy to see if the deploy is successful (this requires instrumenting your code for observability, which - tada - is covered above), and it should be easy to roll back a deploy if something doesn’t go well. Deploying frequently implies that deploys are smaller, and smaller deploys are generally easier, faster and safer.

Many teams implement periods where deploys are forbidden - these can be referred to as code freezes, or deploy policies like “Don’t deploy on Fridays”. Having such blackout periods can lead to a pile-up of changes, which increases the overall risk of something going very wrong.

If you’re on a team that fears deploys, dedicate a percentage of your engineering time to improvements in your deployment pipeline until the fear is gone. On a recent team I worked with, we were able to improve deploy times from 3 hours to 30 minutes, which drastically improved the teams’ confidence in the deploy process. A natural side effect of this was that engineers started deploying much more frequently instead of waiting for changes to pile up enough to warrant a “release” (which was synonymous with a deploy).

The book Accelerate has been getting a lot of attention. If you haven’t read it, I’d recommend it. The team behind it also publishes the State of DevOps reports, which are full of well-researched information about what various companies in the industry are doing. It’s not a coincidence that two of the four key metrics that the book focuses on are directly related to this (Deploy Frequency, Change Lead Time). Shipping is your company’s heartbeat.

4. Trust the People Closest to the Knives

The people who work with a system are the ones who understand it best. This applies to any part of the socio-technical systems within which we all work. In the case of software systems, the engineers who deploy every day and are on-call for critical services understand the level of risk they operate in. A sad trend is that managers tend to overestimate their teams’ progress on certain transitions - i.e. cloud-native, DevOps, etc. The higher up the management chain, the larger this overestimation tends to be. Engineers who deploy and get paged when things break know where the bodies are buried and they know what needs the most work. They should, therefore, be the primary stakeholders responsible for prioritizing technical work.

Another manifestation of this principle applies to platform or services teams. If you’re responsible for building some shared component that’s used within your organization (i.e. a messaging system, ci/cd infrastructure, shared libraries or services) there’s an uncomfortable truth lurking for you: the people who use your work know more about it than you do in many cases. They understand implicitly how it serves customers and they know what contortions or hoops they have to jump through to get it to work. Listen to them for clues on how to improve the UX of your services and tools.

5. QA Gates Make Quality Worse

Many teams have a manual QA step that gets performed before deploys. The idea, I guess, is to have someone run automated or manual tests to verify that a set of changes are ready to be released. This sounds like a comforting idea - having a human being (or team of human beings) “verify” a release before it goes out - but it falls victim to several false assumptions and creates some misalignments that do more harm than good.

First of all, if there’s manual work that needs to be done before a deploy can go out, that creates a bottleneck - if you’re making deploys easy, and deploying small changes frequently, no QA team is going to be able to keep up testing every deploy, and will inevitably block teams from deploying. That’s no good. If you have manual tests, automate them and build them into your CI pipeline (if they do deliver value).

Secondly, the teams doing QA often lack context and are under time pressure. They may end up testing “effects” instead of “intents”. For example, I’ve seen QA teams burn time testing that when something happens in a UI, something related happens in a database. What happens when an engineer refactors that UI component and changes the underlying data model? The functionality works, but the test breaks. Because two teams are involved, this takes coordination and time to fix. Similarly, I’ve seen QA teams block deploys because of failing tests when caching was introduced at the CDN layer - a TTL of 5 seconds on an activity feed may not ever be noticed by a user but it might break QA tests causing unnecessary conflicts between product and QA engineers.

Luckily, solving this one is easy. Instead of having a dedicated QA team work on creating manual and automated test cases that run in a fictitious QA environment, reassign that team to work on continuous testing in production. Instead of being a gate for deploys, a QA team could continuously verify that production is working as expected. QA teams are also well situated to lead Chaos Engineering initiatives, where faults are intentionally injected in production. QA engineers could also work on making the CI/CD pipeline more reliable, so that deploys are no longer a nightmare.

6. Boring Technology is Great.

With thanks to Dan McKinley, always strive for boring tech when possible. Systems are inherently unpredictable, and you want a wide area of expertise to fall back on when shit goes sideways. There are also routine operations that you’ll have to do (deploys, database migrations, etc) and it’s Very Nice to have widely used and tested tooling for this stuff. I think of databases most often when I think about this belief. MySQL is a database with many, many quirks, but it is so widely used, that you should still just use it most of the time.

Very few organizations have the bandwidth to debug unique problems. You don’t want unique problems, especially when performing routine operations - i.e. storing bytes on disk, choosing a new leader in a cluster, garbage collecting objects, querying time-series data, etc. Having unique problems will kill a small to medium size team. It will sap you of your creative energy, which is better used creating value for customers who want to pay you monies for your software. Use your innovation tokens wisely!

Using boring technology means you can lean on a large community of users. Shit on it all you want, but there are very few PHP issues that someone else hasn’t already encountered. Nowadays, the same is probably true for sufficiently widely used versions of Ruby on Rails. I often say that I like to be in the 3rd cohort of technology adoption. The 1st cohort is the bleeding edge organization. The 2nd cohort is the people who feel like they can take some risks. Let those two groups go before you, run into all the big problems, and then you can go, benefiting from all of their hard-won experience.

7. Simple Always Wins

I don’t have much to say about this, but we’re all writing YAML and JSON instead of XML and we’re all using HTTP instead of CORBA, RMI, DCOM, XPCOM, etc. Right? In that same spirit, I’d rather debug problems in a LAMP stack than a Microservices architecture any day.

Quick sidebar on Microservices: as with so many trends in tech, they are often sold as a panacea. Let me be clear: Microservices, designed well, solve some specific problems and as with most solutions to complex problems, involve several trade-offs. If you are going in this direction, I do have opinions on how you should do it, but I also think you should hold off for as long as you can.

8. Non-Production Environments Have Diminishing Returns

A more direct heading for this section would be “Non-Production Environments are Bullshit”. Environments like staging or pre-prod are a fucking lie. When you’re starting, they make a little sense, but as you grow, changes happen more frequently and you experience drift. Also, by definition, your non-prod environments aren’t getting traffic, which makes them fundamentally different. The amount of effort required to maintain non-prod environments grows very quickly. You’ll never prioritize work on non-prod like you will on prod, because customers don’t directly touch non-prod. Eventually, you’ll be scrambling to keep this popsicle sticks and duct tape environment up and running so you can test changes in it, lying to yourself, pretending it bears any resemblance to production.

9. Things Will Always Break

It’s impossible, even undesirable, to avoid failure. Lean into the fact that failure is inevitable, and focus on how you respond to it. This means investing in a continuously improving incident response process. There’s no one-size-fits-all for every company and team, but you should have a good idea of what to do when things go wrong, and you should have mechanisms in place to learn from those situations and improve your processes. Invest in Incident Analysis. It’s a huge field with lots of valuable tools and resources for maximizing the return on investment when incidents occur (or don’t!).

This is an area where Chaos Engineering can be helpful. Injecting failures into production can improve confidence in how to respond when a system starts behaving in unexpected ways. Game Days can be a particularly effective way to allow a team of engineers to practice various outage scenarios.

Conclusion

A lot of the beliefs outlined in this post are at least counter-intuitive, if not somewhat controversial, but I’m nevertheless convinced that they’re true. That doesn’t mean my mind cannot be changed, but it is unlikely. If you strongly agree or disagree, I’m on the internets. I’d be very curious to hear about your experiences.