How to use machine learning to debug outages 10x faster
In this series of posts I will talk about how we’re leveraging machine learning algorithms at Overseer Labs to help our customers perform faster root cause analysis. We’ve worked with Fortune 500 companies and demonstrated significant value.
In this first post I’d like to share a case study that we performed with one of our customers — Rainforest QA.
It all started when we realized that companies cared a lot about up time and reliability. Part of this problem was addressed by AWS, but bad code still gets shipped, and sometimes unexpected failures happen.
So if downtime is inevitable, the best we can do is address the problem as fast as possible, learn from the mistake, and try to prevent it from happening again.
To facilitate the debugging process, companies may expend a lot of energy instrumenting their systems with logs, metrics, exceptions, and other pieces of data that could be useful. They may use different monitoring products for different parts of their tech stack with each tool producing its own set of metrics. In total, the number of metrics collected can range from tens of thousands to billions, every minute.
While this is great, it can lead to a situation of data overload during an incident. Currently, the SRE teams need to manually dig through all of this data and sift through endless number of charts on a dashboard. This is very grueling and painful. But thinking about it abstractly, what they’re really doing is performing data correlation across different data sources. They’re hoping to gain insights and uncover clues that will help them root cause the problem.
Our theory is that with machine learning, we can automate some of this data correlation and assist the operators in finding those insights and clues. We do not claim to perform automated root cause. Instead, we want to guide the operators and help them get to the answer faster (i.e. machine assistance).
And if we’re successful, we’d be able to save companies time, money, and customer annoyance.
So now that we’ve decided on the problem we want to solve, next we’ll have to formulate that into something that we can build.
The current machine learning approaches to this problem consist of alerting on anomalies detected on univariate time series data. While this is useful, individual metrics can be very noisy and thus trigger a lot of false positives. Furthermore, it’s not uncommon for there to be a lot of anomalies in the individual metrics despite the system being healthy. Thus alerting on these anomalies can lead to alert fatigue. And in the face of an incident, this approach may even confuse the operator by leading them down the wrong path.
Before going any further, I want to introduce the concept of system health. The premise is that companies collect many metrics that reflect the state of the overall system. An individual metric may reflect the state of one aspect of the system, but by looking at all metrics as a unit, we can interpret that as the state of the entire system. While we may care about the health of one aspect of the system, we generally won’t care until the health of the entire system is degrading.
Thus, anomalies in a given metric may indicate the health degradation of one aspect of the system, but might reflect a system state that we don’t need to worry about. However, if there was a way to perform analysis across all the metrics as a unit, we can identify true system health degradation. And if system health is what we care about, I claim that this methodology will give us a more accurate proxy with fewer false positives.
So my thesis is that we want to analyze logical groups of metrics instead of analysis on individual metrics. This approach has a few benefits:
- Eliminates the need to come up with a specific KPI to gauge performance of system (though having one is still a good idea)
- Gives us a better summary of the overall system state
- Enables us to interrogate the algorithm and understand the relative importance of each metric in a group for any point in time
- And if used for alerting, this will reduce false positives and minimize the alert fatigue issue
We were able to acquire a dataset of a real incident from Rainforest QA. These guys have several apps hosted on Heroku, but depended on AWS SQS for message passing. The specific incident here was that SQS had an outage. These guys had over 7,000 metrics (across their Heroku logs, CloudWatch metrics, and custom metrics), which were stored in Librato. While they were alerting on too many messages being passed through SQS, they did not anticipate that there could be an outage. Hence, they were not alerting on too few messages being passed through the service. Setting up an alert on all 7,000 metrics would not only require them to maintain the thresholds, but would ultimately lead to alert fatigue.
Thus, in the face of an incident, they were forced to dig through all their metrics before they uncovered the root cause. They had all the necessary data to diagnose the issue, but because there were so many metrics, it took them over 45 minutes to discover the problem.
So my task was to figure out if we could’ve done better. Here’s what we did.
Generating the metric groups
The first thing I did was sit down with the CTO to partition their 7,000+ metrics into logical groups that reflect different aspects of their system. He chose to partition the metrics into 6 groups, in which each group would represent a different app.
This step essentially gives me a specification of their system and enables me to understand which metrics should be analyzed as a group. In this case, the CTO is interested in understanding the health of each of their six apps.
With the system specification in place, we are now ready to start analyzing the metrics for each of the groups. Like all data projects, the first step involves data cleaning.
Upon digging through the data, I noticed two issues:
- There were missing values
- The metrics are collected in varying frequencies (AWS metrics in 1 min resolution, Heroku metrics in 30–60 min frequencies, custom metrics with no clear frequency).
Both of these issues were addressed by using the last available value. So basically let’s say two of the metrics looked like the following:
After the cleaning step, the metrics would now look like the following:
Basically, you can think of our data as a matrix, where each row is a metric, and each column is the time step. What we’re doing in this phase is to take a sparse matrix and turn it into a dense matrix using a very simple imputation method.
Next, we need to perform some pre-processing on the data. The issue here is that the metrics represent different things and may have different units, so we need to find a way to standardize them so they’re all on the same scale. One way to do that is by subtracting off the mean for each metric and dividing by its standard deviation. So we’re essentially centering each of the metrics around 0 and converting everything to an unit of standard deviation.
One important thing to remember is that the mean and standard deviation of the training data needs to be applied to the testing data, so you have to save these parameters.
This approach works fairly well, but can lead to issues sometimes because the metrics themselves are not normally distributed. This itself is another blog post, so I won’t cover that today.
The algorithm — supervised or unsupervised?
In this step, we need to figure out which algorithm to use. We can’t go with a supervised algorithm because we don’t have enough labels. Thus unsupervised learning is our best choice. We’re still interested in anomaly detection, but we’re going to use a multivariate approach instead.
The data was split between training and testing data, where the testing data included the incident and the training data did not. The process described below was repeated for each group of metrics. I used the word “system” to describe a group of metrics. And each group of metrics had its own training and testing data.
The way I’ve decided to tackle the problem was by clustering the training data using k-means. Since the training data didn’t have the incident, you can interpret the result of the clustering as having formed a bounding box around the data in a way that captured the “normal” system behavior.
Each feature vector is a high dimensional vector of metric values for some point in time, so basically it represents a state vector. This vector represents the behavior of the system at that point in time. Thus, you can think of the matrix of all of the training data as having captured the behavior of the system in its entire “normal” state. And by clustering the data, we’re essentially modeling the “normal” system behavior.
One point I’ll make is that because we’ve taken a multivariate approach, each feature vector will also capture relationships among the metrics themselves.
Once we’ve clustered the data, we now need a way to understand if a state vector is anomalous or not. The way we’ve tackled that is by finding the distance to the closest centroid. So we cluster the training data, and then for each feature in the test data, we want to measure its distance to the closest centroid from the trained model.
The intuition here is that if a state vector is close to the bounding box, then it represents a typical system state. However, if it’s far from our bounding box, something about the state vector is off. The distance function we used to compute this metric is the Euclidean distance.
By applying this distance function to all the state vectors in our training/testing data, we now get a new time series metric that essentially summarizes all the underlying metrics for every point in time — think of this as the health score. The higher the health score, the more likely that the underlying metrics are degrading, and hence something might be wrong.
Once we’ve got a way of quantifying the health score of a given system, we may want to understand why the score is what it is. To answer this question, we can interrogate the algorithm. By looking at the feature vector at some point in time and understanding how each metric is contributing to the overall health score, we can formulate a ranking among the metrics at that point in time.
This approach will quickly help to organize the underlying metrics for a group at some point in time by placing the important metrics near the top (because they contributed most to the overall health score) and the least important near the bottom (because they contributed least to the health score).
By looking at the health scores above, we can see that there was a massive spike around the same time in all the plots. If we pick a point on the spike around time step 300 on the “rainforest” plot, we will see a ranking of the underlying metrics for that group.
By looking at the plots, we can see that there are several references to “sqs” metrics. In a matter of minutes, the Rainforest guys would’ve known that there was something going on with SQS. Note that we are not telling them that SQS is broken…we’re simply helping them sift through their data faster.
Thus, it took them over 45 minutes to dig through their data and find the problem, but we would’ve been able to give them the right clues within 4–5 minutes from the start of the incident.
The intent of this post was to communicate the framework we used to think about the problem of root cause analysis. While the approach is fairly straightforward, we had to do quite a bit of hacking/tweaking to get everything working. Additionally, the algorithm itself has limitations. First, k-means is not capable of modeling temporal relationships. Additionally, k-means starts to break down in high dimensions. A model that can correct for these issues will give better results.
If there’s interest, I’m happy to write another post re some of these details and our attempt at solving them.
Let us know what you think, thanks!
If any of you guys are interested in having us analyze one of your prior incidents, feel free to reach out.