Stack for fighting info overload: T.S. Eliot & DevOps

Josh @ Overseer
Overseer Engineering Blog
9 min readFeb 5, 2018

--

“Where is the wisdom we have lost in knowledge?
Where is the knowledge we have lost in information?”
-T.S. Eliot Choruses from “The Rock”

Penned by renowned poet T.S. Eliot in the 1930s, these questions have some powerful relevance to the modern software world. To be fair, the focus of the overall poem was largely the emptiness of life for the “modern” worker. But taken on its own, this pair of lines suggests a truth that (while less grandiose) still has the potential to change how we think about operating complex software systems.

One slightly less poetic restatement of the above might be:

“More information does not imply (and can even be detrimental to) more useful knowledge. Similarly, increased knowledge does not imply (and can be detrimental to) more useful wisdom”
-A soul-less paraphrase of T.S. Eliot

Or even (if you’re not a fan of the whole strict wisdom > knowledge > information hierarchy):

“Data comes in various forms, of varying degrees of usefulness. More data doesn’t necessarily mean an increase in what you can use, and more data can even decrease how much useful data you have”
-An engineer’s paraphrase of T.S. Eliot

Though the poet may be rolling over in his grave at this point (sorry Tom!), hopefully your software brain has started churning by now. If you’ve ever been on call, that brain churning may even bring up some painful memories. Just because you’re monitoring your system in detail doesn’t mean you can quickly find those key facts that will help you determine exactly what went wrong and how.

People in the DevOps world have been aware of this problem for some time, and have tried various approaches to solve or mitigate it. Roughly the approaches I’m familiar with (let me know in the comments if you know more!) fall into three categories: (1) Only monitor a small number of key metrics (aka key performance indicators or “kpis”) so all the data you have is relevant data. (2) Monitor all the things! But have a powerful search/query mechanism that allows you to explore the data quickly so you can find what’s relevant when you know you need to. (3) Monitor all you can, but use machine assistance to help suggest what might be relevant.

Of course these strategies can be mixed and matched to varying degrees, but to keep things clean, let’s visit them each in isolation to see what their strengths/weaknesses are, then we’ll discuss how they can work together at the end.

(1) King KPIs

One of the main strategies for combating information overload is relatively obvious — ignore all but a few carefully selected pieces of information. Choosing these crucial KPIs requires some careful contemplation. Luckily, there’s no shortage of suggestions out there to help you select your KPIs. Whether you use the Four Golden Signals handed down from on high by Google (latency, traffic, errors & saturation), or whether you USE RED (Utilization, Saturation, Errors or request Rate, request Errors, request Duration, respectively), the main idea is to determine what metrics most capture the high-level health of your system and make those key performance indicators the focus of your monitoring.

“Making those KPIs the focus of your monitoring” can mean using these metrics as the source of your alerts or using them as your first go-to for triaging when there’s a known issue. Tying back to our “degree of data usefulness” paradigm, this approach tries to filter out useless data by specifying in advance a subset of the data that is known to always be useful.

The main advantages of this approach are that:

(a) The list of relevant metrics can fit in a human brain as well as on a dashboard (or a small handful of them),

(b) If used as a source of alerts, focusing on KPIs means that every alert is likely to be actionable (the end-goal for an alerting system — nobody likes alert fatigue!), and

(c) There’s a relatively small number of metrics that a newbie needs to learn to understand once the KPIs have been decided on.

However, those golden signals alone can only buy you so much. Anyone who has ever tried debugging software knows that the search for root cause rarely leads exactly where you expected in advance.

https://xkcd.com/1722/

(2) Seek and ye shall find (but only what you sought)

The sad fact that “known” (in advance) metrics alone are insufficient for incident triage motivates the next approach quite well. Flexible metric querying systems aim to help you quickly take “known unknowns” and turn them into “knowns” as you diagnose.

Here, as with KPI selection tactics, there are many options for what precise form your data querying strategy can take. Datadog, Wavefront, New Relic, Prometheus, Graphite, and many others all have their own unique take on what a timeseries metric query language should look like. Despite this, most seem to have at least 4 key capabilities: (a) the ability to aggregate a single timeseries along the time dimension (ex: take data at a resolution of 1 minute and combine it into new data at a resolution of 5 minutes) (b) the ability to combine multiple time series into one (ex: take the cpu usage from all your hosts and spit out the median value at each point in time) and (c) the ability to apply transformations to one or more timeseries to produce new timeseries (ex: take a timeseries for the total number of HTTP requests since the app started and turn that into a timeseries for the number of requests per minute).

Different query languages may excel or struggle with one or more of these capabilities (I’m looking at you and your duct-taped-on tagging system, Graphite), but the key idea is that all of these systems seek to take you from a specific question about your data (“Huh, latencies are spiking in the Foo service for requests involving in-memory computation, I wonder how many memory page reads we’re doing per second on that node”) to an answer (wow, the page reads per second has gone up by 100x in the last 10 minutes!) as quickly as possible. Tying back to the “usefulness of data” idea, this approach is essentially “ignore data until you suspect it might be useful, at which point make it easily available.”

The advantages of this approach are:

(a) It rarely leaves you with unanswered questions

(b) It allows you to focus on what’s important in context (and not just in general, as KPIs do)

(c) A system expert who is capable of making more sense of data than a newbie can have more data (just query for it!) while a newbie doesn’t have to be overwhelmed by that extra info (out of sight, out of mind)

Unfortunately, while a powerful query language may feel like wielding the fire of the gods, relying on it alone may leave you burned. One shortcoming of this approach is that wisdom is not unlimited— your battle-won understanding of the ins-and-outs of your software may have created solid intuition for what questions to ask when the **** hits the fan, but what about your friend Stephanie who just joined last month? Will she know that what the “/api/foo”, “/api/bar” and “/api/baz” endpoints all have in common is that they use a lot of memory, so checking for memory thrashing is probably a good idea when their latencies seem to be spiking more than other endpoints?

Another place where querying falls short is that it assumes you know which questions to ask (or at least which ones to ask first). Based on the symptoms you see, you may have 3 or 4 equally valid theories about what might be going on (and therefore where to look next for clues). If you choose the wrong theories to start investigating, you may find yourself quoting T.S. Eliot, yelling “All our knowledge brings us nearer to our ignorance 😱” (emoji mine).

(3) Robot sidekicks

To recap where we’ve been so far, thinking about KPIs essentially helps you leverage the most from what you already know is useful (“known knowns”), and using a timeseries query language helps you leverage your wisdom/understanding of the system to find facts you think will be useful (“known unknowns”). There are still some gaps left by both of these! What about the things you haven’t thought of but which might be the “smoking gun” if you just happened to look? These “unknown unknowns” can sometimes be very important. To quote a literary genius even older than Eliot:

“… we are impressed and even daunted by the immense Universe to be explored ‘What we know is a point to what we do not know’”
-Ralph Waldo Emerson, Nature

To paraphrase, sometimes the universe of things we don’t know is so substantial that what we do know can be as insignificant as an infinitesimal geometric point.

So how do we leverage what we “don’t know we don’t know” without becoming overwhelmed? The answer to this question is, surprisingly, the same as the answer to the question “how can we make the murder mystery genre better?”: add a robot sidekick.

The basic idea here is to have machine learning algorithms ingest your timeseries data, and then when it comes time to investigate in the face of an incident, have the algorithms suggest clues that might inform your investigation. Once more referencing the “hierarchy of data usefulness” mental model, this approach can be summarized as “ignore data until you need some, then have an algorithm provide some recommendations that help you get to the useful data faster.”

One point that merits emphasis here is that the algorithms are not diagnosing your incidents for you directly, but rather providing suggestions about things you might want to look at. As an analogy, think of the “root cause of this incident” as “a musical artist or movie that fits my tastes”, and consider how Spotify or Netflix manage their content. What they can’t do is connect you with an artist/movie that you’re guaranteed to enjoy. What they can do is leverage much larger quantities of data about you and the music/movie ecosystem than you’ve got and recommend some things that will help you find that perfect piece of content faster than you would have just using their search bar (or query engine 😊).

The advantages of this approach are:

(a) It allows leveraging data that you aren’t explicitly looking at/for

(b) It adds context for your investigation

(c) It gives “newbies” (and even pros) a richer view of the system by showing relationships and patterns in the data that they were previously unaware of, which can be leveraged in future incidents and even in day-to-day work with the software

Of course, robots can’t do everything. This approach, by itself, falls short when it comes to applying “wisdom” as a guide to what’s important. For example, sure you know that the unusual period of 100% cpu utilization on the node that handles some background batch processing is likely less important than the recent period of 100% cpu utilization on an app server node that happened around the same time, but the algorithms can only take that knowledge into account if you communicate it to them in some way.

Putting it all together

So clearly none of these weapons against information overload are sufficient on their own. So how do we use them to make the most out of our monitoring? Of course there are plenty of valid strategies, but one example of a pretty effective setup is:

  • KPIs: Define some KPIs using the 4 Golden Signals/RED/whatever makes sense for you. Put these on a dashboard (or a small handful of them), and define alerts on these.
  • Machine Assistance: Hook up some algorithms (such as those offered by us at Overseer!) to monitor the rest of your data. When an alert goes off, either have this tool send you recommendations (or consult it manually) to see if there are useful insights the algorithms caught that weren’t captured in the original alert but which may guide you to root cause.
  • Query engine: As needed, and armed with the additional context provided by the ML recommendations, search through the rest of your data to answer remaining questions regarding root-cause

This setup is designed to play to the strengths of each of the above approaches. It:

  • Makes sure every alert is actionable by explicitly defining the alerts to be for the things you need to care about
  • Gives you context that results from a full view of your system without requiring you to manually look through tons of dashboards to get a complete picture of what’s behaving strangely
  • Allows you to leverage your understanding of your system in conjunction with the query system to go “the last mile” to root cause

Armed with such a system, hopefully you can keep your system running smoothly without getting overwhelmed, preventing yourself from uttering yet another Eliot quote about where your days have gone: “Where is the Life we have lost in living?”

Missing the machine assistance piece of the monitoring puzzle? We can help with that! Sign up for a free trial or shoot me a line @ josh@overseerlabs.io . Want more people to see this article? claps => more views

--

--

I love the craft of software engineering, the rigor of science as a means of discovering truth, and the fun of board games with good friends.