A Healthy Platform Checklist

This post is up on Hacker News. I’d appreciate your upvote and your comments.

Oh yeah. Reddit too.

I’ve been doing a lot of thinking and talking with a lot of folks lately about what I’ve been calling “devops” and now mostly call platform engineering: the care and feeding of the underlying systems that power scalability, fault tolerance, and developer productivity. There’s a great deal of value to be mined from treating your infrastructure as a product, with strongly guaranteed interfaces and APIs for other product teams to use. Starting early on the kind of infrastructural work that can empower early developers and enable more effective early development, while building towards the long-term scalability needs of startups heading towards the “elbow” of exponential growth, where you Get Noticed and your lives get interesting.

Much of this doc is cadged from everybody else’s notebooks about platform engineering, pick-and-choosing best practices from others and from my own experience, to put together what I’m calling a1 Healthy Platform Checklist. Unlike the Joel Test, however, I don’t intend this to apply to everybody everywhere; this is a set of characteristics that expect the need for scalability and value developer agility and productivity, and not all businesses need that. I emphatically don’t view this as being a purely prescriptive “you suck if you aren’t doing this right now”, but instead something to pin on the wall, a set of guiding principles to keep in mind and ideal states to work towards.

  • I: Systems are provisioned via code.
  • II: Engineers with production responsibilities are embedded with feature teams.
  • III: Production applications are twelve-factor apps.
  • IV: Monitoring and alert configuration is centralized and easily accessible to the people who need them.
  • V: Systems are self-healing without human involvement.
  • VI: Everyday administration tasks are handled through automated systems.
  • VII: Configuration data is stored in an auditable central repository and applications reflect the current state of configuration.
  • VIII: The developer environment mirrors the production environment.
  • IX: Applications are presented with a fresh, consistent runtime regardless of deployment environment.
  • X: Any system that must be deployed more often than monthly does so faster than you can get a cup of coffee.
  • XI: The developer environment is the platform team’s production environment.
  • XII: Platform teams are empowered to consider future needs as well as those of the present.

(If you work somewhere that scores an eight or above, I’d love to talk to you and pick your brain. This is hard stuff and anybody who’s that far along is worth learning from. Get in touch?)

A Healthy Platform Checklist

I: Systems are provisioned via code.

If I had to pick just one thing out of this entire list that I’d need above all else, it’s this one. Whether it’s Chef or Puppet or Ansible or bash scripts2, every developer in your shop had better be more valuable than the time it takes for them to stand up new hardware. Couple that with the inherent systems documentation that you get from reading Chef cookbooks or Ansible playbooks and you’ve got the foundational building block for everything else you do. You just shouldn’t have time to do otherwise.

II: Engineers with production responsibilities are embedded with feature teams.

This is my nod to the concept of devops. If your platform folks are busy shipping the platform, they don’t have time to ship everybody else’s stuff, too. Release engineering is a Thing in some parts of town, but don’t you think that ends up being the same kind of throw-it-over-the-fence silo that everybody’s already unhappy with? C’mon.

You hire smart people everywhere, so give some of those smart people a modicum of responsibility3 around managing their production efforts. Give the responsibility of maintenance to the people building the things you want to deploy. The X team is your experts on X, not your platform team, so make available the tools to perform production tasks and ensure that the platform team is seen as a resource when a problem is beyond the scale of the app and its related code.

III: Production applications are twelve-factor apps.

Some of my early readers have raised their eyebrows at this one. Many of the factors in Heroku’s awesome doc relate to the applications with which the platform team is not directly concerned, so why are we talking about them here?

I view the behaviors of the platform as being interfaces against which your product devs code. In this light, the platform making these expectations clear is no different from any other consumable interface within the environment. If your platform expects apps to be disposable, that’s functionally identical to Rails expecting configuration files to be in YAML. The platform sets the rules of the road, and twelve-factor apps seem to be the best options as of yet for scalable, high-velocity systems. Encoding those expectations into your platform helps keep everybody on track. (Plus, big chunks of the twelve-factor app speak directly to infrastructure and are generally just really good ideas.)

IV: Monitoring and alert configuration is centralized and easily accessible to the people who need them.

Centralized logging and monitoring of system statistics has two major pluses: developer velocity and system self-documentation. SSH may be necessary in the odd edge case, but if somebody has to log into a machine just to read their logs, they’re going to be slower when they do it4. Where self-documentation comes into play is the alerts and alarms that accrete over time within the monitoring infrastructure. An alarm when a given application throws an error for 2% of all requests means something, and having that living record of Things That Have Needed Monitoring can help inform future decisions.

V: Systems are self-healing without human involvement.

Show of hands: who’s good with getting smacked by PagerDuty alerts at 2AM?

My hand may be the only one up, and that’s because I haven’t kept a normal sleep schedule since high school.

In a sane environment, the steps to recover from failure should be well-documented and extremely digital—you should be able to document them as a bunch of if-thens. I spoke to a company that contracts with a third party for their level one support, giving them a runbook against which to work and escalating problems that they couldn’t solve. I’m not a computer scientist, but I’m pretty sure a healthy environment already has lots of if-then processors lying around. They’re called computers. You can rent one that can complete all the if-thens that most failures could ever need for two cents an hour in AWS’s us-east-1. It’s cheaper than having a meat processor doing the moral equivalent (and then calling you when he or she gets off-book) and more reliable.

An issue that needs to be escalated out-of-hours to a developer should require a postmortem. Turn the results into Ruby and have it send you an email telling you what it did, not what you need to do. Then save firing off your devs’ pagers for when something is really wrong.

VI: Everyday administration tasks are handled through automated systems.

I’ve heard this item expressed in an even more aggressive manner: no developer should need SSH access to a production system. I think that’s overkill. No matter how much monitoring you stack up, there are some issues that require getting in there and perturbing electrons directly.

But everyday tasks? See rule V. You’ve got if-then machines all over the place. Use them.

VII: Configuration data is stored in an auditable central repository and applications reflect the current state of configuration.

Heroku’s twelve-factor app discusses placing configuration variables in the environment, and I’m down with that. But how to get the configuration to your application? That’s a little tougher, and it becomes tougher still when you consider modern cloud environments and the latency/consistency tradeoff. I’ve seen S3 used as a config store; the autodeploy tooling we built at Localytics used it for exactly that. But given my druthers I’d rather have a system that’s strongly consistent—I’m torn between Zookeeper and Consul5, but the others have their fans—and can help me push configuration to my clients rather than just have it pulled down.

Application behavior in response to configuration changes is part and parcel of this rule, and depending on your environment that can be real easy or real hard. You shouldn’t have to go manually bounce your servers (rule VI!) to pick up configuration changes, but if your applications are sufficiently disposable, then a kill and restart to pick up new configuration should be just fine. Or you could get fancy with something like Archaius and feel smarter than, like, everybody (including me, that stuff’s hard).

VIII: The developer environment mirrors the production environment.

This one shouldn’t be controversial and I won’t hammer my keyboard too much about it. Impedance mismatches between dev and prod invariably lead to somebody poking the wrong button and then oh hey, PagerDuty is going nuts.

IX: Applications are presented with a fresh, consistent runtime regardless of deployment environment.

Putting myself in the metaphorical shoes of a new deploy of an app at most places I’ve worked is kind of terrifying. Getting plopped down on a machine that’s been running for a year-plus, with who-knows-how-many missing patches that dev and test have. Maybe there were files on disk being used as scratch storage. Maybe they’re even important files that the new deployment of the app has to go take care of6 because if they disappear from disk the data’s lost forever. This is silly. Give every application a new sandbox and shed no tears when it gets blown away. Not only is that the road to easy horizontal scaling, it’s one less thing for your developers to think about.

I’m a fan of Docker for dealing with this. With Docker you present the application a guaranteed-consistent environment. With ENTRYPOINT, you can define a wrapper script that can support the configuration data of rule VII and wire up whatever services that’ll be needed and the application, once started, can just get start, blissfully ignorant of whatever mess was in place for the last deploy7. What’s also nice is that, from the perspective of the app, Docker erases a lot of the dev/prod divide really nicely. Since only volumes and env vars are exposed within the container, it’s totally easy for a developer to use boot2docker and some environment-specific script to mock everything that an app will need, starting it up in a couple seconds.

X: Any system that must be deployed more often than monthly does so faster than you can get a cup of coffee.

Please note: this isn’t an excuse for really slow coffee machines.

Fast deploys are critical both for developer productivity and for production safety. Everybody knows that distractions cause many developers to lose their flow8—I’m fond of saying that a five-minute interruption is really a fifteen minute interruption, and causing a five-minute interruption just to see if your code works is a self-evidently bad idea. And while rolling deploys, etc. in production environments are just good practice, having the agility to go “hey, we can have this fix out there right now” a valuable safety net.

Docker’s real nice for addressing this concern when coupled with a multi-tenancy clusterization solution like Mesos. If you’re in AWS, instances take minutes to get their act together even when you use pre-baked images. When a deploy is only as long as it takes to pull a package from S3, that turns into seconds.

XI: The developer environment is the platform team’s production environment.

It’s not just production that a platform team needs to be careful about. Other developers’ productivity depends on the reliability of their work environments. Causing disruptions, especially unexpected ones, in a developer’s flow is a real good way to make them distrust the product being put out by the platform team.

So don’t break everybody else’s sandbox. Have your own. Break that.

XII: Platform teams are empowered to consider future needs as well as those of the present.

There’s some stuff to unpack here. But this is the most critical bit9.

A couple early readers of this checklist raised a very valid concern: at the earliest stages, a startup can’t afford the manpower for dealing with these challenges. And that’s totally true. The proverbial two guys in a garage aren’t in a place to care about fast deployment because it’s just as likely that their deployment pipeline is Capistrano to a Linode somewhere. And that is completely okay. At that level, investing in tools is obviously not even on the radar.

But that can’t last, right? You start to hire more developers, you start to divvy up responsibilities at a general level. Changes to your products start happening without you as a founder being a direct part of them. People who aren’t you are doing things that your customer sees. And if the environment they’re all working in is the same as the environment as you and your co-founder when you were sitting next to each other in Mom’s garage, you’re going to have chaos. Controlling the chaos is the difference between lurching side to side and moving straight towards your goals. That first big jump, from one or two to five or six, is the optimal place to begin development of your platform. And while “platform” sounds imposing, the surfeit of awesome open source projects that play nicely together means it’s never been better than right now. There’s a compounding factor to platform engineering where the earliest benefits you accrue (say, getting dev and prod to parity) immediately start paying off not just to product devs, but to the further development of the platform itself. Kicking off that virtuous cycle is an investment—but I contend that that early investment is at a deep enough discount that it’s worth consideration early in your company’s life. It remains an ongoing process throughout the life of your shop, but starting early enables you to keep on top of it.

The story’s a little different if you haven’t kept on top of it, if infrastructural concerns are kicked down the road past that growth elbow. The rolling boulder of a high-growth startup is moving at such high speed and there’s so much momentum caught up in it that trying to turn the boulder is as likely to get you squashed as to affect the boulder any. The investment in better process has to be done more carefully in such an environment, for sure. It’s a slower process, with the need for building up more and more infrastructure before you can feel safe turning over your livelihood, the customer-facing bits of your business, to it. And just over the last couple weeks I’ve talked to half a dozen shops that have left that work unfinished because it feels like too much is being spent in establishing a development platform, when the sneaking reality is the opposite: it feels like you’re spending too much because you weren’t spending enough in the first place.

You hire good people. So trust them to fight the fires that need fighting and build towards the future in the way that they, in conjunction with the rest of the good people you’ve hired throughout your dev group, know best.

Comments?

I’m interested in what you think about this stuff, what your experiences have been both consuming and building out these platforms. This checklist is totally a work in progress, but I feel like this is a start. Feel free to follow me on Twitter or send me an email; I dig talking about this stuff and I want to hear from you.


Leave a Reply

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>