15 min read
Doron Grinstein

How and Why You Should Automate DevOps

It’s almost impossible to do DevOps well unless you automate. But where do you start and where does it end?

A Guide to Automating DevOps

DevOps is the foundation of any engineering team, but it rarely gets the time or attention it needs. A DevOps engineer must not only manage how code is organized, stored, tested, compiled, and deployed but also how the resulting application is scaled, monitored, and optimized. It’s a lot of responsibility often spread across too few shoulders.

Unless your team can automate at least some part of your DevOps infrastructure, it’s unlikely you’ll be able to keep up with the demands of the job. This, in turn, makes it unlikely that your company’s application running on top of this infrastructure will remain available, scale to meet demand, stay secure, and offer consistently low latency regardless of the user’s location. 

Automation is essential, but it’s difficult to know where to start. So start here, and I’ll point you in the right direction.

What is DevOps Automation? | Steps to Automating

Let me first define the DevOps tasks available for automation. DevOps is an integrated process that involves the sum total of infrastructure necessary for the development, deployment, and proper care and feeding of your application. However, a key hinge point in this process is the distinction between the DevOps  tasks required to move your code from the developer’s fingertips to a production environment and the tasks needed to manage the application once it’s deployed to customers. These two stages are often referred to as Day 1 Ops (getting your application deployed) and Day 2 Ops (managing your application in production).

Day 1 DevOps Tasks

Setting Up a Code Repository

Code written on a developer’s laptop must be centralized and organized in a secure code repository like GitHub, GitLab, Bitbucket, or the like, with strict controls on how new branches of each file merge into the main and procedures on how to revert to a working version should something go wrong. Most modern code repositories offer a fair amount of automation out of the box, and many offer functionality that extends deep into the CI/CD pipeline.

Testing

Some of your application’s functionality may be testable manually, but establishing test coverage over a wide area of your app’s functionality usually requires writing a fair amount of custom code and/or integrating one or more test automation solutions.

Vulnerability Assessment

Most modern applications consist of custom application code compiled together with a long list of the application’s dependencies. Sometimes developers can inadvertently introduce security vulnerabilities into their apps through an un-patched dependency or use of a known vulnerability. Vulnerability assessment involves the tools and processes used to discover a system’s dependencies and code changes and determine whether they are safe to be included in the production app. It can also scan code for API keys, personal tokens, or other secrets that a developer might have used in software development and pushed to production by mistake.

Constructing Environments using Infrastructure as Code (IaC)

Developers often write code in their local environment (on their laptop), but to test whether the code works in concert with the rest of the application, a DevOps engineer must construct environments in which the application can be developed, tested, and run. 

Usually, at a minimum, engineering teams need a development environment, a staging (or “testing”) environment, and a final production (or “live”) environment where customers can use the application. Each of these environments must not only provision the application with computing resources (CPU, RAM, etc.) on one or more regions but also enable the application to access whatever backing services (RDS, S3, Big Query, AD, etc.) the application’s workloads require. Most environments will also need a few utilities like NAT gateways and load balancers to route and distribute requests properly.

Solutions like Terraform and Pulumi enable DevOps engineers to stand up and tear down environments in a declarative, repeatable, and deterministic fashion. You could do the same things using manual processes, but Infrastructure as Code (IaC) does a lot of the work for you and provides a definitive record of the current state of your infrastructure. Furthermore, IaC keeps track of the dependency graph of what parts of your infrastructure management need to be created in what sequence to ensure you’re not (for instance) creating an EKS cluster before the load balancer it depends upon is constructed.

Managing Data Migration

Your application’s data differs depending on which environment the application is running in. In development, the application might not need much data, but in staging, you’re probably going to want a sufficient dataset to simulate issues that a user might encounter in production. 

You will not want to simply duplicate production data containing user PII into staging or development and don’t want to overwrite production data with dummy data. DevOps must establish systems and processes for migrating data from one environment to the other. If you’re just getting started, then you may be able to get away with migrating manually. As your datasets grow, you will have to find a way to migrate the data schema automatically, whether through your own code or an off-the-shelf solution.

Continuous Integration / Continuous Deployment (CI/CD)

The CI/CD pipeline detects when a change has been introduced into your code repository, automatically subjects the change to a battery of tests, runs the code’s dependencies through a vulnerability assessment, and deploys the code to the pipeline’s target environment – whether development, staging, or production.

In Day 1 ops, the CI/CD pipeline forms the backbone of most automation in DevOps efforts. Its purpose is to automate pulling all the code, dependencies, and tests required for a successful deployment into one automatic process. Without this process, developers must build new versions manually, which makes testing new code far more time-consuming. Many of the code repositories offer functionality for building and automating the CI/CD pipeline, as do many of the larger cloud providers like AWS, GCP, Azure, and Heroku.

Day 2 DevOps Tasks

Scaling and Cost Optimization

Your application might experience consistent, linear growth in user requests over time. More likely, however, you’ll need to plan how your application scales up and down from periods of low utilization, through spikes, and back down again without users noticing any difference in the application’s performance.

The type and degree of scaling automation depend highly on your computing infrastructure. If you’re using AWS Lambda, for instance, scaling takes place beneath the surface, but if you’re developing on a Digital Ocean “droplet” or VM, you may have to pick the droplet size which matches your peak usage even if it’s overkill for periods of low usage. On the other hand, autoscaling a Kubernetes cluster in EKS or GKE involves setting the minimum and the maximum number of pods (or containers) used by your application as requests to the application fluctuate. Kubernetes handles spinning up and down replicas of your workload.

Observability

Being able to closely track how your application is performing has many benefits. It notifies you (or the appropriate developer) if the application is experiencing a problem. It enables you to get to the root of the problem quickly and maintains an audit trail of changes useful for troubleshooting and often required for compliance. Integrating tools like Grafana and Prometheus allows you to visualize and analyze traffic patterns and load characteristics.

Backup and Disaster Recovery

Despite your best efforts, your application will likely go down at some point, and it’s up to you to establish systems and procedures for bringing it back up. In many organizations, these plans are codified in Service Level Objectives and Recovery Level Objectives which document how long it should take you to restore service if something should crash. Usually, restoring service requires at least some human intervention, but backups should happen automatically, and at the very least, interruptions in service should trigger notifications to the appropriate person.

Key Considerations Before Automating

Nothing is ever black and white in the software development lifecycle, and every organization solves the DevOps problem slightly differently depending on their requirements. Below are three general options for handling DevOps automation and best practices, but you can (and probably will) choose more than one.

Custom Development

Developers are do-it-yourselfers by nature, and since even the broadest and most sophisticated solution for DevOps automation processes is just a lot of code, it’s tempting to write the code yourself. Furthermore, you should write at least some of the code yourself to create comprehensive DevOps processes for your team. But if you attempt to develop all the automation yourself, you’ll probably always babysit scripts to one degree or another.

Rather, save the personal touch for the areas unique to your application and critical to your customer’s success.

Point Solutions

Solutions like Jenkins for CI/CD, FluentBit for observability, PagerDuty for alerts, and Terraform for composing infrastructure help to automate parts of the DevOps lifecycle. In their own way, each solves a whole problem. But DevOps is made up of a dozen discrete problem domains, and if you automate one domain, it may create work in another domain. You may, for instance, create a Kubernetes cluster using Terraform, specifying the Kubernetes version, but not know that the cloud provider on which Kubernetes is running plans to end-of-life that version. Upgrading to a newer version can create a cascading series of non-automatable tasks required to sort out what other dependencies within your DevOps infrastructure must also be upgraded.

It is, in other words, hard to do only one thing within DevOps. Each change affects many other parts of the system, whether automated or not. Finding a DevOps automation point solution for each segment of the life cycle doesn’t mean the lifecycle is completely automated.

Platforms

Application Platforms like Heroku or Google App Engine automate broad swaths of your DevOps infrastructure, integrating many discrete tasks into a single unified process. While no platform currently on the market handles DevOps from beginning to end, employing a platform for a broad cross-section of the lifecycle frees up your time and energy for the segments the platform doesn’t cover.

However, be aware that most platforms that automate DevOps are a one-way street. If you use Heroku, for instance, your life will likely be easier, but it will be harder to use many of the latest DevOps tools and cloud services from AWS, GCP, and Azure. Using most application platforms is a wager that what you lose by turning away from the big cloud providers. You’ll make up for in efficiency by increasing DevOps automation across a range of tasks.

What You Shouldn’t Automate

You can’t automate everything in DevOps, nor should you. And even in the areas you can automate, you will need to make intelligent choices about how automation functions. You can set up automatic alerts, for instance, but you can’t alert everyone about everything. You’ll have to think deeply about what events are most important and who needs to know about it. A few well-thought-out notifications may be much more effective than dozens of alerts broadcast to everyone.

Similarly, you should be very careful to provide developers with the necessary access levels. You don’t want a developer automatically given access to user data just because of a hyperactive setup script. Many companies have been shocked at the costs incurred by “autoscaling” run amok. You must thoroughly understand the implications of what you’re automating. Otherwise, it’s likely to create more work than it saves.

The Result of Automation

One of the stories we tell ourselves as engineers is that one day we will run out of things to do. Perhaps if we are too good at automating using specialized tools, it will turn us into button pushers who no longer get to work on interesting problems.

I believe that these fears are an illusion. Automating routine tasks not only opens up time and space for solving problems of a higher order, but also gives you leave to dive deep into the gory details of critical components of the DevOps processes and lifecycle that might otherwise receive short shrift. You might be able to do everything, but you can’t do it all at once. To focus on high-value DevOps improvements, you will need at least a portion of your infrastructure and configuration management running on autopilot.

If you can automate a significant portion of DevOps practices, these are a few of the areas you can turn your attention to with your newly-freed-up time:

Backup and Restore Procedures

If the system you manage experiences a catastrophic failure, chances are, you’re already beyond the point where automation alone can bring you back online. Since most DevOps engineers are fighting fires in other areas, they never take the time to thoroughly plan for how a failure is handled and document and test the steps to bring everything back as quickly as possible. Conducting the fire drill before the fire takes place is the only way to ensure that your DevOps team can respond sensibly when and if the worst case occurs.

Dealing with Data

Having time to think should allow you to improve how your DevOps infrastructure deals with data. You can research to ensure that your system addresses issues of data sovereignty with precision – ensuring that application engineers aren’t able to access data beyond their privilege levels and that data never passes into an unintended jurisdiction without anyone knowing about it. 

You may also be able to purge data you don’t need, or that is a liability to keep and archive terabytes of old data into a less expensive storage facility like S3 Glacier or Google archive-class Cloud Storage. Attention to your data assets can also allow you to develop faster by populating development and staging environments with a sufficient volume of realistic, anonymized data that emulates production issues. With 200 dummy data records in your development database, you may never encounter the kinds of bugs and problems you would if you had 200 million records, each with more true-to-life variability.

Improved Test Coverage

A large application may have millions of lines of code. Most operations teams do not have test coverage for all or most of this code, and many do not even know the coverage ratio. Automating DevOps practices enables you to take the time to measure the degree of coverage and develop strategies for how to close the gap.

Development Operator Procedures

In a development team of any size, the team’s productivity depends as much on standard operating procedures and established communication practices as on individual output. Building development run books, migration instructions, documented procedures, and a thorough changelog allowsso that developers tocan come and go without tribal knowledge being lost. Neither do you want your organization’s data to still be accessible by someone after they have left the organization, so another standard procedure must be the revocation of privileges to all the internal systems that the developer touched.

These tasks often fall by the wayside because of everything else the DevOps engineer is responsible for. However, it’s these and other higher-value tasks enabled by automation tools that allow the DevOps pipeline and function to evolve from the engineering team’s plumbers (called out to fix leaks and flush code from one place to another), to its architectural partners  that work alongside application engineers to build infrastructure that speeds up feature development, improves security, saves money, and delights customers.

A Platform for DevOps that Leaves Your Options Open

When we developed Control Plane – a platform for running microservices – weI addressed the problem of DevOps automation comprehensively. Control Plane automates approximately 50% of the DevOps lifecycle and infrastructure, allowing you to run workloads on multiple regions of multiple clouds and mix and match cloud services in minutes. Crucially, we also designed Control Plane to enable you to integrate with the tools and procedures you are already using, whether you want to add your own Kubernetes cluster running in a private cloud as a Control Plane location, consume services inside a VPC, or add your observability tooling like Datadog or Logz.io. We provide the benefits of a comprehensive DevOps platform without the lock-in and limitations.


Frequently Asked Questions

Why do companies need automated DevOps?

Automated DevOps allow developers to improve and update their software faster and more efficiently. When DevOps processes become automated, this allows developers to focus on other areas of work while trusting in the reliability of automated systems that are easily scalable. In addition, automation can assist developers in catching bugs quickly and implementing an effective solution for infrastructure management.

What is an automated DevOps pipeline?

An automated DevOps pipeline is a set of processes and specialized tools that give developers the tools they need to build and deploy code. Depending on the project type and scale, the pipeline can vary in size and complexity. For developers looking to automate DevOps, it helps to understand the pipeline steps to seek out opportunities for automation.

Will DevOps get automated?

Automation has been a foundational part of DevOps since its beginnings. Data from the 2021 State of DevOps report showed that up to 97% of companies who use advanced DevOps agree that automation is a key part of making their work more quality.