Our challenges in building a “useful” anomaly detection system

Published in

Overseer Engineering Blog

9 min readNov 17, 2017

Last year I wrote an article on how we were leveraging machine learning at Overseer Labs to help our customers perform faster root cause analysis.

Since then, we have been fortunate enough to get our tech deployed more widely. Consequently, we have been able to stress test our algorithms, uncover blindspots, and make refinements. Given all that we learned, I wanted to write a followup post to highlight some the challenges that we faced.

The Goal

Overseer’s objective is to help engineers sift through their data and surface the most critical metrics during an incident.

As detailed in my earlier post, we leveraged a top down approach where we first divided a customer’s metrics to reflect the different components of their system, next we developed a synthetic health score for each component, and then during an incident, used the health score to figure out which subset of metrics to dig into. This empowered an engineer to sift through their data and arrive at root cause faster.

We leveraged a multi-variate anomaly detection algorithm to solve the problem. However, we ran into a lot of issues before converging onto something that was “useful” for the engineer. I’m using the word “useful” to refer to a system that surfaced actionable insights from the user’s perspective.

To highlight some of our key challenges, I felt it might be useful to categorize them in the following manner: imperfect modeling, problem formulation assumptions, numerical issues, and data distribution shifts.

Imperfect Modeling

First, we quickly learned that operational metrics can vary quite a bit. Some are continuous, but some will exhibit step functions. Other metrics are sparse (mostly zeros) or have lots of missing values (metric value not emitted by code). Some are periodic, while others are wildly erratic. And the list goes on!

This observation indicated to me that this would not be a trivial modeling problem. Each “kind” of metric would have its own set of patterns and to nail the modeling problem, we’d need to model each “kind” of metric separately. However, there’s no way to know in advance how many different “kinds” of metrics exist in the universe of all metrics. Additionally, we wanted models that were easy to understand/debug, along with having fast training/evaluation times. Thus, we had to accept that it wouldn’t be possible to build a perfect model. The best we could do is focus on building a system that demonstrates value for customers and leveraging their feedback to surface the relevant modeling flaws that require attention!

Given the multivariate problem formulation, here are some of the issues we ran into.

Spiky health scores:

For our models, a feature vector consists of a high dimensional vector of metrics at a given point in time. However, we realized that as the number of metrics in the feature vector increased, a slight perturbation in many of metrics was causing the health score to spike. Individually these perturbations were small, but in aggregate the impact was large, especially as the number of metrics in the feature vector grew large. This was due to the distance calculations (e.g. L2 norm), which summed over the squared perturbations, and gave us a large health score.

To mitigate this issue, we had to apply a log transformation to the health score to ensure its values were within a reasonable range.

Non-interesting changes in the data:

The damping constant helped, but we uncovered a new issue. The algorithm was catching legit regressions in the data, but the results weren’t interesting to users. For example, this would happen if there was a random spike in the metrics. Operational metrics are generally noisy, so this is not unexpected, but flagging every single of these spikes made it difficult for our users to leverage the tool during an incident. From a mathematical perspective, the algorithm was doing the right thing. However, from a user’s perspective, this was a false signal.

To solve this problem, we decided to leverage dimensionality reduction techniques to remove some of the noise as a pre-processing step. This step helped to remove these random blips, which then enabled the algorithm to focus on identifying interesting changes (e.g. lots of metrics changing a lot simultaneously). Additionally, this step also helped mitigate the curse of dimensionality problem.

Erratic metrics:

Another issue that really impacted the health score was erratic metrics (e.g. memory metrics). These metrics had periods of stability followed by periods of sharp discontinuous changes. These sharp changes were atypical enough such that they single-handedly caused the health score to spike.

One way we solved this problem was to create a derived feature (e.g. the derivative) and use those derived values in our feature vector. That actually introduced a new problem, but mitigated this specific issue.

Problem Formulation Assumptions

In solving the problem using an unsupervised anomaly detection technique, we ran into a number of challenges.

Lack of ground truth:

We built our models using a limited set of incidents, so there was always a question of whether or not the algorithms worked. We also learned that it was difficult to get engineers to annotate data in real-time, so that made it difficult for us to rapidly iterate on the algorithms. To make things worse, if the algorithms didn’t demonstrate at least some value from day one, the engineers lost confidence and stopped using the tool.

There was no easy fix to this problem except to the test the algorithm across a wide variety of data sets from different companies. This gave us confidence that the algorithm worked and provided immediate value to our customers. Once they started trusting the tool and using it more regularly, it was easier to ask engineers to annotate the data that resulted in incorrect predictions. We then leveraged this ground truth to uncover modeling flaws and refine our techniques.

Dirty training data:

Another issue with anomaly detection techniques is that the models are only as good as the data that it was trained on. We ran into issues where the data had short, but massive “blips” (or real incidents) in them. Initially we didn’t expect this to be a problem since the vast majority of the data exhibited “typical” behavior and only a small percentage of the data was dirty. However, we learned that these true anomalies/incidents in training data were enough to skew the internal model parameters. The result was that it was giving suboptimal results for the ranked metrics.

To mitigate this problem, we basically had to algorithmically clean the data to remove these blips and incidents. Once these blips and incidents were removed, the data was cleaner, and the internal model parameters were more reasonable. This topic is itself another blog post, but if there’s enough interest, I can write about it.

Numerical Issues

There were some metrics that were relatively constant for most of the training period. In one part of the ML pipeline, we standardized the data (e.g. subtracting the mean and dividing by the standard deviation). The intent was to get all the metrics into some common unit such that we can perform an apples-to-apples comparison across the metrics.

Where we ran into problems was that these metrics had a super small standard deviation. Thus, if the metric had a small (uninteresting) change in real-time, when we standardized the data, that metric had a super-large value because it was being divided by a small number. From a mathematical perspective, this made sense because the metric appeared to have changed a lot based on the training data. However, from an user’s perspective, this incorrectly caused the health score to spike and incorrectly ranked those metrics very highly.

To mitigate this problem we had to identify those metrics and override their standardization parameters with a more reasonable value.

Shifts in data distribution

In the world of micro-services, the application is changing dynamically, and that gets reflected in the metrics in terms of behavioral change. Consequently, this opened up a whole new set of challenges for us.

Expected changes in metrics:

There have been situations where a change in the system occurred, but was a relatively rare scenario such that the behavior was not reflected in the training data. Additionally the change in the metric was sustained.

An example is if there’s an increase in the total physical memory of an instance. The relevant metric will exhibit a stepwise function (value X for some long period of time, followed by value Y for some new range of time). When the change from X to Y is significant enough, it will skew the mathematical calculations and cause the health score to spike.

This could be interesting for an SRE, however, the issue is that the change in the relevant metric will be sustained. Thus, the health score spikes up and stays up until the algorithm is re-trained (with enough new data) to reflect this new change. This made it difficult for our customers to utilize the tool effectively because it incorrectly indicated that something was wrong with the service when in reality there wasn’t, especially if the change was expected.

To deal with this problem, we had to provide a way for the operator to acknowledge a drastic change in a specific metric. What this feedback will do going forward is “mute” that metric from contributing to the health score until the algorithm is re-trained with new data some time in the future. To do that we basically had to implement a NoOp that would tell the algorithm to ignore certain metrics in the feature vector.

As a result, the following would happen:

Expected change occurs and the health score spikes.
The SRE acknowledges the change and tells the algorithm to exclude the relevant metrics, which reflect the system change, from the health score calculation.
The health score will regress back to “normal” territory and the operator will be able to continue using the tool to diagnose other incidents.
Once the algorithm is re-trained using data that reflects this system change, the metrics will get “un-muted.”

Model decay:

Another issue we ran into was when the real-time data was changing too fast and the algorithm couldn’t adapt fast enough. Initially we were using online learning, but ran into some limitations because the algorithms weren’t decaying old data. Thus, training data from long ago had the same weight as recent data.

To appreciate this problem, consider the scenario where you’re updating the average of a metric in real-time. The update rule would be to multiply the old average by the old number of samples, add the new metric value, and divide by the old number of samples + 1. This gives you the updated average.

Now where this becomes problematic is if the old number of samples never reset. In an online learning setting, this value will always get larger. And what’s the impact? As more samples are seen, it will take longer for the updated mean to reflect the new “true” mean of the data (because the large denominator makes it harder for the old average to change).

To mitigate this problem, we considered two approaches:

Building our own streaming algorithms to decay old data.
Regular re-training on the most recent data.

We were tempted to pursue (1), but decided (2) would work best for us at the moment.

New normal behavior/new metrics/obsolete metrics:

Aside from the model decay issue, which resulted from new normal behavior of the system, another reason we chose to re-train our models regularly was because the environment was changing too rapidly. New metrics were getting added to the system and old metrics were becoming obsolete and removed.

While better streaming algorithms would solve the model decay issue, it would fail to address the new metrics/obsolete metrics issue in our current problem formulation. However, regular re-training would solve all three problems, so we went with that. This decision introduced other problems, but that’s for another post!

Symmetry between training and testing data:

One of the things Overseer will do is pull data from your existing monitoring tools (e.g. Cloudwatch, Wavefront, Datadog), crunch on it to build models, and then evaluate your real-time data against those models to surface relevant data during an incident.

An issue we ran into was that the data provider was giving us historical training data at some granularity (e.g. metric value averaged over 2 minutes), but gave us real-time data at a very different granularity. Since the granularities were so different, they would’ve produced different model parameters, and hence very different feature vectors. By mixing these two granularities, it threw off the calculations, and the health score spiked.

The fix here was simple: we had to ensure that there was symmetry between the training data and real-time data!

Conclusion

The intent of this post was to articulate some of the challenges we faced trying to build an anomaly detection system that would be “useful” for our customers. As you can see, it wasn’t as simple as pumping data into some algorithm that magically worked on the first try.

The key issue is that we were trying to model something that was very difficult to model (i.e. non-Gaussian data) and constraining ourselves to models that were easy to understand/debug, along with having fast training/evaluation times. As a result, there were a lot of details and edge cases that needed to be addressed. And the lack of ground truth only amplified the problem. It was only through patience and real-time feedback of expected behavior that enabled us to track down these issues and build something of value.

If there’s interest, I’m happy to continue writing about additional challenges that we face in the future.

Let me know what you guys think, thanks!

Upal Hasan

If any of you guys feel Overseer’s technology could help the SREs at your job, we’d love to help! Just reach out to me at upal@overseerlabs.io.