Free public beta: Machine Learning+incident triage=😊Ops

At Overseer Labs, we kind of hate that doing DevOps often involves spending a lot of time in stressful situations, including worrying about what the heck might be going on in your systems that you don’t know about, scrambling to resolve incidents when there’s a lot on the line, and getting drowned in an onslaught of potentially important but potentially totally irrelevant alerts. We surveyed a bunch of people doing DevOps, and it turns out that many of you hate these things, too!

The cool thing about being engineers is that we can build things to eradicate problems we hate, so Overseer has been working to build something that can help ease some of the DevOps pain. Today, we’re finally ready to let the general DevOps-ing public try it! What exactly are we letting people try? Time for a little show-and-tell 😊.

The basic idea is to help you triage in the event of an incident or quickly determine whether an alert is something concerning or a non-issue. We do this by taking all of your existing metrics and summarizing them as a series of overall health scores. Let’s say you get an alert from PagerDuty, or you notice on one of your CloudWatch dashboards that a bunch of your users are getting 500s. With Overseer, you can look at some summary health scores for groups of your metrics to see what portions of your system might be contributing:

At Overseer, our infrastructure is simple enough that we prefer to break down our metrics by EC2 instance, and even some groups for subsections of an instance (ex: stage_network is a group for metrics relating to our staging machine’s network activity), but if your infrastructure is more complex you might prefer to summarize metrics at a higher level (ex: a summary score for an Auto Scaling Group or load balancer). Think of these groupings like a dashboard summary — if you have metrics that you usually look at on one dashboard, they’re probably well-suited to be grouped together into a single Overseer score. If you use CloudWatch, Overseer will suggest these metric groupings for you, so no head-scratching time organizing your metrics is required.

Once you see some metric group that’s behaving anomalously, you can click on a point to see what interesting things were going on in the metric group at that time. For example, clicking on one of the points near the peak of our prod_box score above takes me to this view:

Right away I can see why Overseer thought our prod machine was behaving strangely — we started using CPU credits, freed up some memory, and (as I can see by scrolling down) freed up some disk space at the same time. These key metrics floated to the top, while other metrics that weren’t behaving strangely (like the swap space used) sunk to the bottom of the page. If this had been a true incident (think some rogue batch process rapidly gobbling up resource locks and causing other processes to idle), I’d now have some of the critical clues to the incident without having to dig through umpteen-zillion dashboards. If you want a more in-depth real-world example of Overseer helping with incident triage, I suggest you take a look at some of our case studies.

If you think you’d like to try this out, you can sign up for free here. We don’t require any payment info, just access to pull some of your CloudWatch metrics and read access to the Auto Scaling and EC2 APIs (so we can understand and suggest groupings for your metrics). You should be up and running in just a few minutes. Questions/comments/feedback? That’s why we’re doing this! Send me a line at josh@overseerlabs.io . Happy DevOps-ing!