Chaos Engineering: A Comprehensive Guide

Before we set any background for an explanation of the term "chaos engineering," we believe that you must possess some knowledge regarding the importance of testing in the software development cycle.


This quote explains the importance of a bunch of good software test cases:


“No amount of testing can prove software right; a single test can prove software wrong.”— Amir Ghahrai


That’s where intelligent test cases and concepts come into the picture.


In this time of incremental releases and agile development, continuous examination of software is imperative to offer a seamless, faultless, and consistent experience to the users. At present, tech organizations are implementing modern and automated ways to test software thoroughly and completely.


Taking modern tech infrastructure and distributed systems in an account, testing seems trickier and more difficult to examine it from every standpoint of view. Therefore, organizations are developing smarter ways to test their systems from unit to functional to the infrastructure-testing.


One such concept, introduced by Netflix, is chaos engineering.


Chaos Engineering: Infrastructure Testing In Netflix Way


Chaos engineering was introduced by Netflix, one of the largest media subscription services with around 150 million paid subscriptions worldwide.


Before we understand this concept, here is a brief explanation of terms we are going to use in this blog.


In modern application life cycle, there are four environments that are used by the tech companies around the globe to develop software.


Development Environment: Where a program is developed/coded


Test Environment: Where a product is copied and tested carefully to make it perform as expected


Acceptance Test Environment: Where the client tests the system and verifies whether it meets the expectations or not


Production Environment: The live environment a product goes after passing acceptance testing


What Is Chaos Engineering?


To define it in simplest terms, chaos engineering is a disciplined approach to identify vulnerabilities in systems in the production environment.


It is implemented to check the system’s reliability, stability, and capability of surviving against unstable and unexpected conditions.


When we consider large-scale distributed systems, there are numerous chances of failures including application failure, network failure, infrastructure failure, dependency failure, and so on.


Moreover, the system is being developed in micro components and deployed on cloud-enabled architecture, making it more prone to failures and outages.


The Need For Chaos Engineering


Here are some points that justify chaos engineering:


It improves the resilience of the system


You will get to know the weaknesses of the system


It is proactive in nature, as opposed to the reactive nature of traditional testing


It exposes hidden threats and minimizes the risks


Difference Between Chaos Engineering And Testing


The first concept of testing is that it has several sets of inputs and predicted outputs to obtain desired system behaviors. It has limited scopes as it does not generate any completely new knowledge about how the system will behave if something could go wrong.


Chaos engineering performs wide, careful, and unpredicted experiments that generate new knowledge about the system’s behaviors, properties, and performance. It has a wider scope and unplanned combinations to observe the system very closely with various study formats.


Chaos experiments are limitless, creating more opportunities to test the system from every point of view. You can create intentional chaos to check whether a system can withstand it or not.


A Brief History Of Chaos Engineering


It all started when large-scale distributed systems were growing in popularity. It was difficult to test the resilience of the system in a distributed environment. Here, resilience not only means the system’s ability for failures but ensuring maximum quality of the systems.


Enter Netflix


In 2011, Netflix decided to move from a physical infrastructure to the cloud to provide users with a better video streaming experience. The Netflix Engineering Tools team came up with an innovative idea to test the fault tolerance of the system without any impact on customer service.


They created the Chaos Monkey tool which is inspired by the idea of a monkey who enters in the farm and randomly destroy the objects.


From the Netflix Technology Blog:


Chaos Monkey is a tool that randomly disables our production instances to make sure we can survive this common type of failure without any customer impact.


Unlike a physical environment, the cloud move of Netflix is assumed to have more breakdowns since it is abstract and distributed in nature.


The reason behind running the Chaos Monkey tool in the Netflix system is simple:


The cloud is all about redundancy and fault-tolerance. Since no single component can guarantee 100% uptime (and even the most expensive hardware eventually fails), we have to design a cloud architecture where individual components can fail without affecting the availability of the entire system. In effect, we have to be stronger than our weakest link. - Netflix Tech Blog


After the success of the Chaos Monkey tool, the Netflix Team has created a suite of tools that supports chaos engineering principles and named it the Simian Army, to check the reliability and resiliency of AWS infrastructure.


List Of Tools Developed By Netflix:


Chaos Monkey


Latency Monkey


Doctor Monkey


Conformity Monkey


Janitor Monkey


Security Monkey


Chaos Gorilla


10–18 Monkey


These are all chaos tools that are constantly testing the system against all kinds of failures, building a higher level of confidence into the system’s ability to survive.


How Can You Use Chaos Engineering In Your System?


You might say, “We are not Netflix and we don’t have any large-scale system and huge customer base like Netflix.”


That’s true. But, over time, it has evolved and is not limited to one organization or a digital company like Netflix. There are many companies with huge customer bases that are dedicated to offering a seamless experience to their users. And to ensure consistent performance and constant availability, healthcare, educational, and finance organizations are implementing chaos experiments.


Four Basic Steps To Perform Chaos Engineering


Chaos in distributed systems requires two groups to control and monitor the activities – an experimental group that experiments, and a control group that deals with the effects of experiments.


Define a steady-state that represents the normal behavior of a system


Chaos engineers hypothesize an expected outcome when something goes wrong


Design experiments with variables to reflect real-world events like dependency failure, server failure, network or memory malfunction, and so on.


Measuring the impact of test and observing the difference of the steady-state in both the groups


If an engineering team can find weaknesses in the system, then it is a successful chaos experiment, otherwise, they expand their hypothetical boundaries.


When weaknesses are found, the team addresses and fixes those issues before they become system-wide troubles.


Note: As chaos experiments are in a production environment or closer to the production environment, there are chances that customer experience might get affected. So, it is always wise to plan the smallest experiments and be ready to carefully handle the impact.


Ultimate Goal Of Chaos Engineering: Discover the “What-If” Scenario


A distributed system usually tends to have more failure points due to its complexity and large-scale nature.


Chaos engineering tries to discover those failure points and identify what will happen in the case of resource or object unavailability.


This is a very suitable practice in modern software development approaches like DevOps and microservices architectures.


Today, not just Netflix, but many giant organizations are using it to ensure that a system can withstand any breakdowns and later on, they fix the issues in the system during chaos experiments.


Companies Who Are Using Chaos Tools:


Facebook


Google


Microsoft


Amazon


Twilio


LinkedIn


Chaos Engineering And DevOps: Better Understand Your System Amidst Frequent Releases


DevOps is all about continuous improvement and frequent releases.


Chaos principles are the best approach to test a system’s ability against failures when it comes to DevOps-driven software development. System architects and testers are in a hurry to release the software and you can find unknown conditions when you perform chaos engineering in distributed, continuous-changing, and complex development methodologies.


We have seen drastic changes in software development frameworks and methods in the last few years. Monolithic has been replaced by cloud and microservice architecture to build the software at high velocity.


Here also, chaos works best since it has the potential to identify dependency failure or conjunction failure points that are common in the microservice structure of the system.


Chaos Engineering: More Than Preventive Mechanism


"Failure is a success if we learn from it." -Malcolm Forbes


This quote makes much more sense when understanding the idea behind chaos principles. You need to learn from the failure to improve your system, to make it more resilient, and to increase the confidence in the system’s capabilities.


There are many tools available for chaos and many organizations are experimenting with different techniques and tools to make it more mature and useful approach. By intentionally creating chaos in the system, an organization can achieve long-term software resiliency. The resiliency and quality are considered as important factors when we talk about distributed systems with faster release cycles.


Comments

Popular posts from this blog

SSO — WSO2 API Manager and Keycloak Identity Manager

Recommendation System Using Word2Vec with Python

Video Analysis: Creating Highlights Using Python