So in our adventures at Leaf, building out our new environment, Will and I ran into a persistent problem – EC2 doesn’t guarantee that an instance will ever be shut down gracefully. No guarantees for upstart scripts, /etc/init.d, whatever. This is particularly problematic for dealing with Chef, where an instance needs to be deleted from the Chef datastore when it goes away. If you don’t, knife search will happily return tens or hundreds of nodes that no longer exist. Which is great.
No, wait. It’s crap.
Enter asger. Since autoscaling groups can publish notifications to SNS, asger will watch an SQS queue subscribed to SNS and execute arbitrary tasks based on the up/down pattern. Created nodes invoke up functions, terminated nodes invoke down functions, life is good. Currently there’s a task for deregistering Chef nodes, but asger is pretty flexible–Route 53 subscriptions, tie-ins to systems like Consul, that sort of thing. You can get asger via RubyGems with gem install asger or on GitHub; use it in good health.
I’ve been doing a lot of thinking and talking with a lot of folks lately about what I’ve been calling “devops” and now mostly call platform engineering: the care and feeding of the underlying systems that power scalability, fault tolerance, and developer productivity. There’s a great deal of value to be mined from treating your infrastructure as a product, with strongly guaranteed interfaces and APIs for other product teams to use. Starting early on the kind of infrastructural work that can empower early developers and enable more effective early development, while building towards the long-term scalability needs of startups heading towards the “elbow” of exponential growth, where you Get Noticed and your lives get interesting.
Much of this doc is cadged from everybody else’s notebooks about platform engineering, pick-and-choosing best practices from others and from my own experience, to put together what I’m calling a Healthy Platform Checklist. Unlike the Joel Test, however, I don’t intend this to apply to everybody everywhere; this is a set of characteristics that expect the need for scalability and value developer agility and productivity, and not all businesses need that. I emphatically don’t view this as being a purely prescriptive “you suck if you aren’t doing this right now”, but instead something to pin on the wall, a set of guiding principles to keep in mind and ideal states to work towards. read more
So, part one of this little getting-acquainted with Mesos left me with one (1) Mesos server, running apps in an app-ish sort of way via the Marathon framework. Which is cool. But there’s a lot of places to go from here.
I could install additional frameworks onto the box, if I wanted–I could bolt in Chronos to give me a cron replacement or I could wire up Spark to map and reduce things on HDFS. But those aren’t really interesting to me, not least because I don’t have any jobs that have a burning need to run at 3AM, nor do I have a few terabytes of interesting data lying around to gnaw on. So instead I’m going to stick with Mesos and see what I can do about expanding my little cluster past a singleton server. read more
So, blog reboot, like, thirty-four. I’m bad at blogs. But this time, I come bearing neat stuff to talk about, so maybe this’ll stick.
Anyway, since starting at Localytics, I’ve found myself thrown into a bunch of new-ish stuff. My prior ops experience was much more “developer moonlighting as a sysadmin”, rather than buzzword-compliant DevOps Ninja Powers. At Localytics I’ve been leveling those up pretty quick, though, and there’s some fun stuff I’d like to talk about a little. But we need to figure out what we want to make public first, and it’ll probably end up on the company blog before it ends up here, so I’m going to natter on a bit about something I’m doing in my spare time: setting up a Mesos cluster and populating it with apps for a side project or two. read more