You're overcomplicating production

18 Oct, 2024

You're going to have outages in production. They're inevitable. The question is how to best minimize outages, both their frequency and duration.

Common wisdom advocates for using managed k8s and databases, containerized services, horizontal scaling for redundancy, building images in CI/CD, and defining your infrastructure as code. This is Best Practice™, and no one was ever fired for doing that. In fact it's quite good for your resumé.

I will argue that such common wisdom is wrong. These tools are complex, and complex infrastructure is a wildly net-negative distraction to your business and inherently risky.

Over the past 10 years I've run services in production with millions of customers, exceeding 4-9's uptime. During that time I experimented with a wide variety of architectures, starting with self-managed servers on a fleet of OpenBSD VMs, then writing a custom orchestrator deploying services with OCI containers, and finally adopting GKE-managed k8s.

Over time I increased the complexity of the system, looking to how the industry solved real problems I was facing, such as:

If my server goes down, how can I have redundancy?
If my VMs get deleted, how can I recover quickly?
If I need to hire more people, how can I get them up-to-speed?

But like Goldilocks looking for soup, nothing was quite right.

Deploys take 15 minutes on Github workers. When something goes wrong, it kicks off a murder-mystery figuring out the problem, needing root cause analysis and post-mortems. It takes an entire team to manage the system, requiring complex network diagrams and human processes to keep everything up-to-date.

When I started up a new project for the first time in 10 years, I realized just how simple it all could have been. Even with millions of customers and 40 employees, we could have easily run on a single VM using Go and SQLite. We could have 10x'd with that same strategy.

There's a whole industry pushing complexity. That's why GCP/AWS/Microsoft/Hashicorp and every VC company under the sun sponsor so many events -- they need to convince you that you need what they sell. They market it like any other product. Once it takes hold in the industry as Best Practice™, it self-perpetuates, with developers advocating for the latest hype without fully understanding the trade-offs, until it's years later and you're stuck managing MongoDB.

Fuck that. I'm swimming against the current. Build the simplest systems possible.

Simple systems are faster to iterate, easier to debug, and just as secure and reliable if not more so.

As I write future blog posts, I'll be covering this in more detail:

Picking a VM provider
Managing a Linux server
Scripts to make deploys easy
How to minimize outages
How to secure everything

Subscribe on RSS to follow along.

#devops #sysadmin