Sam Learns Azure

Splitting up our ARM Templates with Azure CLI to improve our infrastructure deployment reliability

Today we are going to implement a new strategy to manage our ARM templates with Azure DevOps. We’ve been using ARM templates for several years, and they have been great as infrastructure as code. Over the last few years, our project has continued to grow in complexity, to a 1000+ line nested ARM template. However, as it’s grown, we’ve had a few bugs that have affected our deployment reliability and have been difficult to track down, (we suspect the issue is a race condition between two resources). Additionally, the time required to test this large ARM template can be anywhere from 20-60 minutes. In short – we needed a solution to make our infrastructure as code more reliable and maintainable. We have two goals today

Before we continue, I wanted to thank my colleague Ken Skvarcius, for originally sharing the concept and this solution.

The solution

We will create smaller ARM templates for each resource, and use Azure CLI to deploy the templates. Let’s look at how this new solution addresses (and doesn’t) our problems. First, advantages:

What about disadvantages? Is this a perfect solution?

Overall, we think we are in a better place. Controlling complexity is everything in this modern world, and having simple templates, with an easy way to test them, is a better place to be

Implementing the solution

We started by refactoring all of our ARM templates into individual files. Let’s walk through an example, the Redis deployment. This is the code to deploy Redis – yes it’s only 19 lines! (excluding the 50 lines of parameters, but most of this is just default value ranges).

Next we need to write Azure CLI to install the ARM template. We create a new Redis PowerShell file, add the parameters we need, (lines 1-10), setup the variables, (lines 12-15), and then deploy with “az deployment group create“, passing in parameters for the resource group, template file, and template parameters. That’s it! We do have some relative complexity on lines 23 to 27, where we extract and upload the Redis connection string to our key vault, but that is still only 4 lines. To test the Redis deployment, we only need to pass in the 7 parameters to the PowerShell file.

Finally, we integrated this in with our Azure Pipelines yaml deployment pipeline template. The snapshot below shows that we decided to deploy Redis in it’s own job (more about this in the next section). The dependencies for this task (line 109), is just a code infrastructure job, (containing a key vault and storage). We download the build artifacts on lines 116-122, and then in lines 123-129, run Azure CLI with the 7 correct parameters to run our PowerShell script.

Initial testing shows what we’ve already seen on other projects, the templates deploy quickly, efficiently, and reliably.

Using more parallel jobs

Now that we have a reliable deployment, let’s look at our secondary goal to deploy in 20 minutes or less. Let’s look at what our current/old situation is, as a comparison.

  1. Deploying the old solution to a new resource group: This needs 60-80 minutes, as we need to run the deployment twice for our resources to deploy. The required double deployments, (and our failure to resolve it), was what brought about this post:
old solution, deployment #1
old solution, deployment #2

2. Deploying the old solution over an existing deployment: This runs in 10-15 mins. Good numbers, but given this isn’t really doing much apart from correcting some configuration drift, it is expected.

3. Deploying the new solution to a new resource group: Runs in ~60 minutes – no worse than the old solution, but it runs reliably, successfully, every time. However, in the time it takes to deploy our infrastructure, it’s running all of the tasks in serial… but do they need to be in serial? We could in theory, run our Redis, CDN, and SQL server deployments in parallel, as they don’t depend on each other. Let’s try that next.

4. Deploying the new solution to a new resource group, with multiple jobs: Runs in 30-35 minutes, about 30 minutes faster than the serial solution, with 10 parallel jobs (you may remember we used parallel jobs a few months ago with amazing results). Not quite 20 minutes, but as we mentioned earlier, it’s reliable, which is more important, and this is just the initial deployment to a new region. (How quickly can you deploy to a new region/resource group?)

5. Deploying the new solution over an existing deployment, with multiple jobs: Repeated runs over existing resources runs in 10-15 minutes, essentially equivalent to the old solution, but with more reliability.

Overall, our new solution is roughly 60% faster than the old solution, and most importantly, our reliability numbers are great!

Why didn’t we use linked templates?

When we started this project, we initially looked at linked templates as an option. There are similarities to this solution, but linked templates have a complexity around the location of the child templates in storage with SAS tokens. Essentially we were trading complexity of a single nested ARM template for complexity in security and setup – and this still doesn’t make our parent templates any easier to test.

Wrap-up

Overall, this new process has been reliable and has given us some excellent results. It was a complicated change as we worked to not break our existing builds, and as such, the pull request was pretty massive, with 108 commits over 2 months (and another dozen in two follow-up PR’s as we resolved some minor bugs over the next few days). It was all worth it, and it’s changed the way we are going to manage ARM templates going forward.

There is still some good possible next steps – can we check Azure to see if we need to deploy the ARM template at all? We will find out. Until next time!

References