Clouds, containers & microservices: infra and architecture from ~100 DevOps/SRE surveys

The survey

Once upon a time (aka 1968) there was a guy named Melvin Conway who ran an experiment in which two teams were asked to create compilers. One team was made up of 5 people and was asked to create a COBOL compiler. The other team had 3 people and was asked to create an ALGOL compiler. The COBOL compiler from the 5 person team ran in 5 stages, and the ALGOL compiler ran in (yup, you guessed it) 3 stages. After this observation Conway made a statement which has come to be known as Conway’s Law: “Organizations which design systems…are constrained to produce designs which are copies of the communication structures of these organizations… The larger an organization is, the less flexibility it has and the more pronounced the phenomenon.”

Fast-forward to today, and a movement has been started which has internalized this idea, with the goal of streamlining the pipeline starting with a good idea and ending with running code delivering value to customers. Conway’s Law provides a helpful hint on how to do this: if you want to reduce the bottlenecks in your idea-to-running-code process, you should reduce the communication bottlenecks between the people involved at various steps in that process. This movement (as you’ve probably guessed, my informed tech-blog reader) is known as DevOps. It has a cousin concept known as “Site Reliability Engineering,” which applies many of the engineering principles of software development to the ops environment.

If the structure of a company’s people determine the structure of the processes and architectures, it seems like we could learn a lot about the latter by asking the former. So….. I did that 😊. I posted a plea on reddit (/r/programming and /r/devops), and Hacker News to practitioners of DevOps and Site Reliability Engineering to have them take a survey about themselves and the people they work with, their daily concerns and pleasures, the codebases they work with, and the tools they use. Around 200 generous people responded (though a little less than half of them finished), and I’ve been blogging about the results. I’ve covered the people and the fears/favorites in other posts — today we’ll dive into the infra and architectures.

Microservices, monoliths, and maybe dinosaurs

One key architecture question was aimed to discover the structure of the beast the respondents are trying to tame. Specifically, I asked:

Which of the following most accurately describes your architecture?
- Monolith
- Microservices
- Neither (ex: software is desktop application or other)

“But wait!” some of you are exclaiming, “What about the path of moderation, between the monstrous monolith and the medusa of microservices? Why didn’t you give a choice for the sagacious ‘Service Oriented Architecture?’” As reply to this, I can only point to a sidebar on the martinfowler.com page on microservices, which argues that Service Oriented Architecture means too many different things to too many different people, and “microservices” is really about the shape of the structure, not the number of services or the size of each. So to keep things clean, participants had to choose “microservices,” “monolith,” or “other.” The results? Microservices FTW!

High-level architectures from 89 responses. Standard errors shown below percentages.

This shouldn’t be too much of a surprise given all the buzz around microservices in recent years. I guess it just goes to show that microservices and DevOps are a match made in heaven (or at least in the cloud). Or, as one DZone article puts it: “DevOps and microservices both are pretty good things, but they really work best if applied together.”

While we’re on this question, let’s take a peak to see if we can find any evidence for Conway’s Law. Microservices are associated with breaking up your application into several distinct, independently runnable pieces, so Conway’s Law might suggest that it is more likely if you have several distinct, independently run teams. Luckily, one of the earlier questions in the survey asked respondents how many teams at their organization were dedicated to DevOps/site reliability. When we slice the above pie to turn it into two, one for organizations with one or zero DevOps/SRE teams, and one for organizations for multiple DevOps/SRE teams, we see that having more teams is in fact associated with using microservices:

High-level architectures for respondents from organizations with less than 2 DevOps/SRE teams (left) and greater than or equal to 2 DevOps/SRE teams (right). 58 answers went into the left plot, 29 into the right. Standard errors shown below percentages.

So what about those people who answered “other?” This bucket might contain people working with stand-alone apps, libraries, or perhaps even some dinosaur applications stuck in desktop-land (jk, I suppose desktop still has its place — even web devs aren’t ready to say desktop is dead). If you are working on a dinosaur desktop app that has some SaaS-y competitors and are wondering who will win the market, I’m sorry to say that sometimes the future is easy to predict. Unsurprisingly, most survey participants are already providing their functionality through “X as a Service:”

How users access the survey-respondents’ applications. Exact question and answer wordings are shown. Standard errors are below the percentages.

Containers in the cloud

With some rough architecture info out of the way, let’s look at infrastructure. The first question I asked related to this topic was:

Which of these are leveraged by your application’s production environment?
- Cloud
- Containers
- Dedicated, on premise hardware
(Users could select multiple answers)

Infrastructure in use by the survey respondents’ applications. Note that respondents could choose multiple answers, which is why these add up to more than 100%. Standard errors shown.

86% of respondents are in the cloud, which isn’t that surprising, but is still reasonably high — heck, 86% could be Twitter’s uptime circa 2010 😉 (remember fail whale, anyone?). 52% of survey takers are using containers, which is a reminder how quickly things change in the software world; after all, Docker did only enter the scene publicly 4 years ago.

Though dedicated hardware made a solid showing, most (75%) of the people who indicated that they’re using dedicated hardware also indicated that they’re using cloud infra. Some of these cloud+hardware responders may be indicating that they’re using a private cloud, but based on a question about cloud providers which we’ll cover next time, it would seem most of these individuals are actually just using a hybrid cloud solution.

Counting at scale

The last pair of questions aimed at architecture/infrastructure were related to just how many pieces of server/vm infrastructure are in use in the respondents’ applications, as well as how many metrics they’re collecting on these applications. Since counting to numbers with 4, 5, or more digits is rather boring and time consuming, I basically asked respondents to guess at the order of magnitude for each of these values:

Roughly how many servers/vms are in your production environment?
- 1
- 2–9
- 10–99
- 100–999
- 1000–9999
- 10000+

Number of servers/vms in production. Standard errors shown below percentages

How many performance/system health metrics are monitored for your codebase? (Ex: watching CPU usage and memory usage for 10 servers counts as 20 metrics)
- 0
- 1–9
- 10–99
- 100–999
- 1000–9999
- 10000+

Number of metrics being monitored. Standard errors shown below percentages.

If we map each of the size categories starting at 10 or higher to a panel of the xkcd below (Realistic? No. More fun to talk about than “orders of magnitude?” Absolutely.), it looks like the most common scale for respondents is the “throw away the whole room” size, or 100–999 servers.

https://xkcd.com/1737/

However, it’s worthwhile to note that we also have a reasonable representation across the range from “toss the machine” (10–99) to “let the datacenter burn to the ground” (10000+).

When it comes to monitoring, it appears that more than half of respondents had at least 1000 metrics they’re monitoring. At the scale of 1000 metrics, a human would certainly have trouble keeping an eye on all of these, and they would definitely need some sort of automated alerting put in place. Even at the 1000 metric scale, alert fatigue can start to set in, but by the 10000+ metric scale, it’s a virtual inevitability unless you’re using the right tools.

Clouds, containers, counting: conclusion

Now that we’ve covered the teams and roles in DevOps, the daily concerns and joys, and the infrastructure and architecture, the only portion of the survey remaining to be covered is the tools of the DevOps and Site Reliability worlds.

By the way, if you’d like to play around with the survey results yourself, the raw results are available on GitLab, along with a Python script for analyzing the data (it’s what I’ve been using for the results shown in this blog).

Shameless plug: Does your system have tons of metrics? Are you suffering from alert fatigue, high incident resolution time, or lack of vision into subtle production issues? Do you want to be able to tell your friends your monitoring stack is better than theirs because it’s backed by machine learning? Then have I got a deal for you! For just 1 easy click, you can check out http://overseerlabs.io/ and be utterly amazed by its revelations. But wait! That’s not all! If you drop me a line at josh@overseerlabs.io, you can have all your questions answered COMPLETELY FREE.

Follow us on Twitter or here on Medium to catch the next post!