Wreak havoc with Azure Chaos Studio

Wreak havoc with Azure Chaos Studio

The main goal with Azure Chaos Studio is to introduce faults and problems into your environment in order to test your resiliency. In the same way that we configure backups we need to sometimes test restores, to be confident in our setup we need to test our other capabilites in the environment.

With chaos studio we can for example

  • Test high availablity
  • Test capacity in our applications, servers & other systems

It can be a very powerful product to use when designing resilient environments and following the well architected framework to introduce faults and errors in your dev and stage environments to find issues fast. Think of DevOps fail early or fail-fast principles, limit the amount of problems that can reach production environments.

Azure chaos studio can support two types of faults in your environment

Service-direct

  • Talks to services and runs against Azure Services directly
  • Example: Shut down virtual machines inside a scale set

Agent-based

  • Agent installed inside the virtual machine
  • Example: CPU and RAM preassure spikes
  • Example: Killing Windows services and/or processes

The entire fault library can be found here

Important! It is not a simulation, it will actually perform the effect (Like CPU spikes)

How does it work?

You create experiements with branches and steps inside Chaos Studio

  • Branches run in parralell
  • Steps run sequentially

Onboard a virtual machine

  • Create a user assigned managed identity
  • Chose which resource group to store it in and give it a name, I called mine uami-test-we-chaos
  • Search for Chaos Studio in the Azure Portal
  • Select Targets -> Select your VM and under Enable targets -> Enable agent-based targets (VM,VMSS)
  • Select the managed identity you created in the previous step and select Disabled under Application Insights
  • Select Review + Enable

This will install the ChaosAgent virtual machine extension. You can verify it is installed here:

Create an experiment

  • Back inside Chaos Studio select Experiments -> +Create -> New experiment
  • Select your subscription, resource group and give your experiment a name
  • Give your step & branch a name
  • Select +Add action & Add fault
  • Under faults select CPU pressure then select next and chose your virtual machine under Target resources and select Add
  • Select Review + create

Run your experiment (Warning, not a simulation things will actually happen!)

  • First we need to make sure the used assigned managed identity has the appropriate permissions to carry out the experiment, I gave mine the Contributor permissions (The documentation states reader should be good enough, did not work for me. As this is a service in preview some bugs may happen occassionally):
  • We must also add the used assigned managed identity to the virtual machine, go to your VM and select Identity in the left pane
  • Now head to Chaos Studio and start your experiment
  • Under History you can follow the status of your experiment, on the right hand side select Details
  • I went to my virtual machine page and in the left pane selected Metrics and was able to see my experiment take affect
  • Back inside Chaos Studio I went to the details of my experiment which reported a successful status:

Conclusion

This is in preview still so some issues exists and that is to be expected. However I am excited for this tool as you can focus on scenarios you want to validate and make sure the correct things happen when "shit hits the fan" so to speak.

  • Does following well architected framework (WAF) work?
  • What happens if DNS fails, how does it impact my application?

We often design for failure but dont have a great way to test that things actually are working as intended in case there is a fire. I recommend beginning with a small scope and expand systematically to not destroy anything in production.

This service is currently free while in preview. I suspect that when it hits GA you will most likely be charged for time experiements run and some other things like CPU spiking etc as it will consume more compute resources.

Reference

What is Azure Chaos Studio Preview?
Measure, understand, and build resilience to incidents by using chaos engineering to inject faults and monitor how your application responds.

Azure Chaos Studio - John Savill

About me

About me
If you have landed on my page you will have already understood my passion for tech, but obviously there is more to life than that. Here I will try and outline a few of my other hobbies. Strength training I am a person who loves to move around and challenge