Fears and favorites from 100+ DevOps/SRE surveys

The Survey

DevOps and Site Reliability Engineering, despite their near decade long tenures, are still being defined on a day-to-day basis as practitioners work to improve and support their organizations. One definition that will always be growing is what it means to be a practitioner of DevOps/SRE. What are the roles and responsibilities? What are the main concerns/joys? What do the organizations/codebases look like for people doing DevOps? And finally, what does the landscape of tooling for DevOps look like at present?

To help add to the growing definition of what being a DevOps/SR Engineer means, I put together a survey and posted it to reddit (/r/programming and /r/devops), and Hacker News in hopes to snipe some passers-by with promise of an interesting survey. Just over 200 people responded, resulting in lots of interesting data on the life of engineers doing DevOps/SRE. In the previous post, I covered responses to questions about the respondent’s roles and the makeup of their teams. In this post, we’ll really get into the head of some engineers with questions on joys and fears. I plan to follow this up with a post or two discussing their codebases/infrastructure, and finally, their toolchains.

Fear the unknown! (and the undefined)

Now we start getting to the good stuff — hopes and fears. We’ll start with fear so we can end on a high note 😊. Before we get there though, it’s worth noting that at this point in the survey, the number of respondents left dropped off sharply. This question was right after a “next page” button on the survey, so presumably people either (a) didn’t see the “next” button (unlikely), or (b) they decided they’d rather get back to sweet talking Jenkins into managing their personal laundry pipeline than continue taking an online survey (can’t say I blame them!).

The “fear” related question I asked participants was:

In addition to the build pipeline, other options included:

-There might be customer affecting performance issues we’re not catching
-Our mean time to recovery (MTTR) is painfully high
-Our DevOps team is having trouble scaling with demand
-False positives create so much noise that relevant alerts are lost
-DevOps is not really respected at my organization
-I have to wear too many hats at my current organization
-Other

If they selected an option for “other”, they were asked to describe the concern in a free-form text box below (you can see those responses here).

Treating “pretty much impossible” as 0 and “Constant worry” as 3, the results came in as:

0 equates to “Pretty much impossible” and 3 to “Constant worry”. These error bars represent the spread in the data, NOT the uncertainty in the average value. About 100 responses went into each of these, except for “other”, which got about 45 responses. MTTR=Mean Time To Recovery, CD=Continuous Delivery

Whoa there! Those are some big error bars. It might be worth bringing some of the stats talk out of the captions and into the prime time to explain what that means. The error bars in the above chart represents the standard deviation, which characterizes the spread in the responses. So this large standard deviation indicates that a lot of people had very different opinions on what concerned them, which is interesting evidence of how diverse the field is. However, this is not to be confused with the standard error, which characterizes how uncertain we are about what the average answer was (i.e. a large standard error would mean that if we had a lot more people take the survey, the average might change by a large amount) </stats lesson>. If we look at the same chart, but use the bars to represent how much the results might have been different with more people, it would look like:

Same as above, but with error bars representing standard error rather than standard deviation

Now we have a clear picture. Though any two people may differ a lot on how concerned they are about these issues, there are some definite differences between the concerns on average.

Code cockroaches

The most concerning issue to survey takers was that there might be customer-affecting performance issues they’re not catching. If you have a system where this is possible, it’s understandable that this would be very concerning to you. If you don’t know something is happening, your hands are tied (or really, your eyes blindfolded) when it comes to fixing it. This reminds me somewhat of the Codeless Code post about The Three Most Terrifying Words:possible race condition.” If there’s a cockroach in my house, I’d rather know it’s there (and preferably where it’s hiding)!

If you, dear reader, are one of the people that shares a concern for unknown unknowns in your system’s behavior, I’d be remiss not to mention Overseer Lab’s efforts in this area. To avoid going into full-on sales mode, I’ll just suggest you take a look at this case study where we at Overseer helped a large streaming company address just this type of issue. </sales mode>

This might be a problem! Hey, this might be a problem. Yo, did you know this might be a problem?

Following behind this were two concerns which were basically tied for second. The first of these was the fear that alerts which represented non-issues (false alarms) would drown out real problems. By the way, Overseer helps with that too! </sales mode FOR REAL this time>. This is, of course, related to the top concern — if you’re missing relevant alerts, you are almost by definition missing production problems. Engineer Luis Canales (one of the generous respondents who left their email address), stated concisely what this issue looks like in practice: “Our shop is drowning in alerts, and I feel like most aren’t even actionable. One of my major concerns is getting so many alerts that we’ll lose sight of those that actually are important, true, and actionable.”

“Our shop is drowning in alerts”

One hat, two hat, red hat, blue hat

Tied with worries about false alarms was the fear that “I have to wear too many hats at my current organization.” The road to DevOps zen appears to be a maze with many false turns. One of these dark alleys happens when “ops should be integrated more with dev” turns into “ops should be responsible for fixing EVERYTHING.” This can lead to a sense by some that their job is ill-defined or that expectations are too high. One condition that can lead to this kind of culture is when the company is small (and thus there are plenty of hats lying around that have to get dumped on somebody’s head). The problem is not isolated to startups though — large organizations often have their own variety of tech-colored hats to go around. One respondent put it this way in a follow-up: “‘DevOps’ is simply expected to be the problem solvers for the business in a smaller scale setting. It doesn’t matter if it’s fixing the printer, doing the VPN, operating the global scale cloud infrastructure or writing custom code and interacting with stakeholders or writing C*O level financial or operational reports. At the enterprise level, embedding a DevOps role into an existing team is also similar in that they are expected to be the general IT Problem solvers for the individual business unit.”

When you wish upon a survey

What’s halfway between a concern and a love? A wish, apparently. One of the questions was free form and very open ended:

As an SRE/DevOps person, I wish…

To get something quantifiable out of this free form text, I read through people’s responses and tried to categorize each one into a bucket relating to the common themes that showed up. Yes, this is certainly subjective, so you can check out the how I categorized the responses if you want to see things without relying on my interpretation.

Wishes broken up into general categories after the fact. Grayscale values are technical, colorful ones are cultural. There were 55 respondents who answered this question. Standard errors are below each percentage. Real errors might be a little harder due to mistakes in my subjective categorizing

I wish I could go into more detail on these answers and the follow up interview questions I had, but hey — you’re a busy person, so those are the highlights.

Dev ……. Ops…… Management: crossing some chasms

As you can see, the majority of things DevOps practitioners wish they could change about their organizations are cultural rather than technical. If you could wave a wand and give everyone at every company an understanding of what DevOps is and isn’t, foster more integration between Dev and Ops in practice rather than just theory, and improve communication between management and engineers, you’d have addressed 45% or more of the “wishes” above.

Luckily, as we saw in the large spread of responses for the “respect” portion of the concerns question, there are several degrees of success to which organizations have ensured their DevOps/SRE employees feel valued — which is a useful proxy for how well they’ve handled understanding each other/communicating/integrating. Engineer Randy Hoover discussed what it looks like to be in early stages on the path to DevOps nirvana: “DevOps in our org has been a shoehorned operations team that is having to try and educate everyone from CEO down on what devops actually is and the goal we are trying to achieve. It’s been a real struggle…”

“[We’re] having to try and educate everyone from CEO down on what devops actually is and the goal we are trying to achieve.”

However, there are certainly those who have realized the dream, such as the interviewee who stated succinctly (and anonymously): “We seem to have a pretty symbiotic relationship with our implementors.” So how did the successful ones “get there?” Some, like respondent Matthew Walter’s organization, started there: “It’s just how things are done. Within the rest of the company there is such a drive and desire to be better that people are willing to do almost anything that will improve their outcomes. It makes collaboration as easy as explaining an idea and how it will help.” Others, like David Mark’s company, got there after years of effort: “[My recent employer] has a great appreciation for the DevOps way of running its technical operations and has embraced it at all levels. This is the result of a few years of focused emphasis on Site Reliability Engineering/DevOps best-practices. The processes and tools are mature, and allow the company to safely deploy multiple times per day.” When it works, it works.

“We seem to have a pretty symbiotic relationship with our implementors”

My favorite part: people’s favorite parts

Once I had given survey-takers a chance to talk about what they would change if they could, I gave them a chance to talk about what they were already enjoying about their jobs. The answer? A lot, apparently! Despite the concerns above, living the DevOps way is still bringing some warm fuzzies into the hearts of stereotypically cold techies. The exact question I asked was:

My favorite part of being an SRE/DevOps person is…

Again, I tried to find the common themes in people’s answers and then place the responses into categories (see exactly how I mapped the responses here). Here’s the result:

Standard errors shown, from 62 responses. Actual errors should probably be larger due to the subjectivity of breaking things into buckets

One of the two most common themes was that people take joy in that ever-present life-blood of anyone doing DevOps: automation. It’s no secret that automation is important to DevOps (heck, it’s right there in the middle of the manifesto). But who knew it could be so satisfying? I followed up with several people who mentioned automation gratification in their surveys to find out just what makes it so fun to automate things. My favorite answer comes from Brian Larsen: “I’ve been around since every server was setup by hand (and at best documented somewhere) and I’ve seen the wins you get from automating. I worked in companies where the loss of one server meant days or weeks working on reconstructing it, even with backups, to now where we frequently rebuild and replace a whole instance as a means of deploying new code, it’s an amazing change.”

For the other top favorite, the variety of the job, it seems appropriate to quote some of the various ways people expressed this in their survey responses:

  • “[My favorite part is] touching every part of the business”
  • “[My favorite part is] always learning and absorbing new technologies”
  • “[My favorite part is] broad involvement across the software lifecycle and ability to interact with many slices of technology and teams.”
  • “[My favorite part is] learning new s***”

Conclusion

I hope you learned some new… stuff! In the next post, we’ll take a look at the codebases/infra these engineers are automating. After that, we’ll see just how much variety there is in the DevOps world by looking at the toolchains people are using.

One of those people concerned about false alerts, or unknown cockroaches in your operational code? Check out how Overseer Labs can automate the discovery of the operational issues that really matter using the power of machine learning. Question about the article or want to chat? Shoot me a message at josh@overseerlabs.io

Follow on twitter!