Splitting up our ARM Templates with Azure CLI to improve our infrastructure deployment reliability

Posted by

Today we are going to implement a new strategy to manage our ARM templates with Azure DevOps. We’ve been using ARM templates for several years, and they have been great as infrastructure as code. Over the last few years, our project has continued to grow in complexity, to a 1000+ line nested ARM template. However, as it’s grown, we’ve had a few bugs that have affected our deployment reliability and have been difficult to track down, (we suspect the issue is a race condition between two resources). Additionally, the time required to test this large ARM template can be anywhere from 20-60 minutes. In short – we needed a solution to make our infrastructure as code more reliable and maintainable. We have two goals today

  • Primary goal: Improve reliability of our infrastructure as code
  • Secondary goal: Keep deployment time under 20 minutes. This is secondary, as reliability is the most important metric, speed is a bonus

Before we continue, I wanted to thank my colleague Ken Skvarcius, for originally sharing the concept and this solution.

The solution

We will create smaller ARM templates for each resource, and use Azure CLI to deploy the templates. Let’s look at how this new solution addresses (and doesn’t) our problems. First, advantages:

  • Reliability: deploying each ARM template one at a time, allows us to remove a lot of complexity and give us confidence we are making small changes everyone can understand. The small templates (nearly) always work and don’t have race conditions with our services being deployed
  • Maintainability: most of our new ARM templates are 100 lines or less, with one exception, (SQL), being 200 lines – mostly because it’s two resources, a SQL server and SQL database. These smaller templates also help with testing, as it’s much easier to make a change and just test that individual template than the entirety of all 10 services. Since we are using Azure CLI to deploy individual ARM Templates, we can also use PowerShell on our laptops, with Azure CLI installed, to test each service one at a time.

What about disadvantages? Is this a perfect solution?

  • Overall, we have a slightly more code now. To split our large ARM template into 7 services, we need multiple ARM templates and Azure CLI steps in Azure DevOps, which in turn means we need to manage variables and state a little differently.
  • We have to manage resource dependencies ourselves. Since the templates are separate and not linked, the “DependsOn” property of the ARM templates is essentially useless. When we deploy our ARM templates, we need to make sure we manage the order of resources ourselves – for example, we need to deploy Azure Storage before we deploy a CDN. Fortunately, much of this is mitigated as we do some some small nesting of resources in specific services – for example the database in SQL server.

Overall, we think we are in a better place. Controlling complexity is everything in this modern world, and having simple templates, with an easy way to test them, is a better place to be

Implementing the solution

We started by refactoring all of our ARM templates into individual files. Let’s walk through an example, the Redis deployment. This is the code to deploy Redis – yes it’s only 19 lines! (excluding the 50 lines of parameters, but most of this is just default value ranges).

Next we need to write Azure CLI to install the ARM template. We create a new Redis PowerShell file, add the parameters we need, (lines 1-10), setup the variables, (lines 12-15), and then deploy with “az deployment group create“, passing in parameters for the resource group, template file, and template parameters. That’s it! We do have some relative complexity on lines 23 to 27, where we extract and upload the Redis connection string to our key vault, but that is still only 4 lines. To test the Redis deployment, we only need to pass in the 7 parameters to the PowerShell file.

Finally, we integrated this in with our Azure Pipelines yaml deployment pipeline template. The snapshot below shows that we decided to deploy Redis in it’s own job (more about this in the next section). The dependencies for this task (line 109), is just a code infrastructure job, (containing a key vault and storage). We download the build artifacts on lines 116-122, and then in lines 123-129, run Azure CLI with the 7 correct parameters to run our PowerShell script.

Initial testing shows what we’ve already seen on other projects, the templates deploy quickly, efficiently, and reliably.

Using more parallel jobs

Now that we have a reliable deployment, let’s look at our secondary goal to deploy in 20 minutes or less. Let’s look at what our current/old situation is, as a comparison.

  1. Deploying the old solution to a new resource group: This needs 60-80 minutes, as we need to run the deployment twice for our resources to deploy. The required double deployments, (and our failure to resolve it), was what brought about this post:
old solution, deployment #1
old solution, deployment #2

2. Deploying the old solution over an existing deployment: This runs in 10-15 mins. Good numbers, but given this isn’t really doing much apart from correcting some configuration drift, it is expected.

3. Deploying the new solution to a new resource group: Runs in ~60 minutes – no worse than the old solution, but it runs reliably, successfully, every time. However, in the time it takes to deploy our infrastructure, it’s running all of the tasks in serial… but do they need to be in serial? We could in theory, run our Redis, CDN, and SQL server deployments in parallel, as they don’t depend on each other. Let’s try that next.

4. Deploying the new solution to a new resource group, with multiple jobs: Runs in 30-35 minutes, about 30 minutes faster than the serial solution, with 10 parallel jobs (you may remember we used parallel jobs a few months ago with amazing results). Not quite 20 minutes, but as we mentioned earlier, it’s reliable, which is more important, and this is just the initial deployment to a new region. (How quickly can you deploy to a new region/resource group?)

5. Deploying the new solution over an existing deployment, with multiple jobs: Repeated runs over existing resources runs in 10-15 minutes, essentially equivalent to the old solution, but with more reliability.

Overall, our new solution is roughly 60% faster than the old solution, and most importantly, our reliability numbers are great!

Why didn’t we use linked templates?

When we started this project, we initially looked at linked templates as an option. There are similarities to this solution, but linked templates have a complexity around the location of the child templates in storage with SAS tokens. Essentially we were trading complexity of a single nested ARM template for complexity in security and setup – and this still doesn’t make our parent templates any easier to test.

Wrap-up

Overall, this new process has been reliable and has given us some excellent results. It was a complicated change as we worked to not break our existing builds, and as such, the pull request was pretty massive, with 108 commits over 2 months (and another dozen in two follow-up PR’s as we resolved some minor bugs over the next few days). It was all worth it, and it’s changed the way we are going to manage ARM templates going forward.

There is still some good possible next steps – can we check Azure to see if we need to deploy the ARM template at all? We will find out. Until next time!

References

11 comments

    1. Great question!

      I continue to run the same set of services, but now I’m always running them in the same order, and individually they are easier to test. Idempotency still exists in successive deployments.

      Like

  1. Was there any reason why you don’t just use a parent child relationship with your templates ? i.e. using parent ‘deployments’.

    You would break them out as you have done, then have a parent “Deploy All” template that orchestrates everything below? That way it’s a single pipeline task and easier to manage?

    That way you are still fully declarative, can still test/deploy individual templates, however you get the benefit of being parallel via dependson within the parent template, you can then also turn stages on or off via feature flags/conditions?

    here is a sample: https://github.com/brwilkinson/ADF/blob/master/AZE2-ADF-RG-D01/0-azuredeploy-ALL.json of what a parent template might look like?

    Was just thinking of cost versus benefits of either methodology? Plus thinking over limitations of each method.

    Liked by 1 person

    1. Yes! I found that the parent/child template was longer, larger, harder to maintain, and harder to test. My old parent/child template runs in 15mins. My new templates, spread over multiple jobs, runs in 5-6mins, plus I can test each one individually.

      Additionally, if I’m deploying something like a web service multiple times, instead of having the web app twice, I have the same web app template, and I configure and run it for each instance.

      Liked by 1 person

    1. Great question. I found this blog that explains it better than I could:

      “It all depends on what you want to do in Azure. You can use both in Cloud Shell in PowerShell, and you can use both remotely at your workstation to manage the Azure cloud. If it’s resources in Azure that you want to manage, then use Azure CLI; and if you need to manage Windows Servers, then use Azure PowerShell.”

      Reference: https://www.msp360.com/resources/blog/azure-cli-vs-powershell/#:~:text=Azure%20CLI%20commands,-Resource%20group&text=To%20put%20it%20simply%2C%20Azure,to%20manage%20the%20Azure%20cloud.

      Liked by 1 person

Leave a comment