Chaos engineering initially took off as a means of understanding the distributed software systems. It popularised the concept of proactive failure testing to build better systems and birthed the concept of “failure as a service”, or “resilience as a service”. Today, site Reliability Engineers have become familiar with the basics of proactive failure testing as a means to create better systems.
Chaos Engineering is an attempt to find weaknesses before they appear in system-wide, aberrant behaviours. There are fallouts in system weaknesses wherein a service could be unavailable, outages due to downstream dependency getting too much traffic and cascading failures when a single point of failure crashes.
Chaos Engineering was popularised by engineers at Netflix whose primary goal was to build stable, secure and bug-free software. To achieve this, Netflix engineers introduced a tool called Chaos Monkey built on the principles of a model that can be applied across different projects and different departments to make IT services more resilient. The Netflix infrastructure is built on this methodology which allows engineers to significantly improve their systems without comprising the complexity of the system. Chaos Engineering also accelerates flexibility and rapid development.
Building Resilient Systems
Netflix engineers published a document — Principles of Chaos Engineering which detailed how the empirical system based approach helped systems withstand outages. It also helped engineers understand the complex behaviour of distributed systems and helped them observe the systems in a controlled environment. Over a period of time, the team used these learnings to strengthen the microservices infrastructure and developed the concept of Chaos Engineering.
Citing an example, the team revealed how Chaos Engineering was put into effect for one of their services — subscriber, which is used to handle user management activities and authentication. Due to unforeseen situations such as downstreaming services, subscriber service may go out of control. To minimise the fallback, the team devised a strategy to improve resiliency around the product so that customers can rely on it even during downtime. To reduce latency from traffic, the team observed deviations in two groups:
- Control group
- Variable group
Why Companies Are Relying On Chaos Engineering
The rise of microservices and cloud architectures has led to increased complexity in infrastructure with the systems becoming prone to outages and failures that leads to revenue losses. According to Gremlin, provides a framework for Failure-as-a-Service, even brief outages can heavily impact the bottom line revenue, so reducing the cost of downtime is becoming a KPI for many engineering teams. Gremlin indicated how in 2017, 98% of organizations said a single hour of downtime would costed their businesses over $100,000. The company cited an example of British Airways that suffered an outage in May 2017 which left thousands of passengers stranded and also cost the company around $102.19 million.
This California-headquartered firm offers a full suite of enterprise failure testing solutions so that engineers can find out how resilient their production system is. Distributed systems are more complex than monolithic systems and it is harder to predict when and how they will fail. Some of the top drawbacks of distributed systems are unreliability of network, zero latency and that the network is homogeneous.
Here are some of the top reasons for deploying failure-as-a-Service
- Chaos Engineering is billed as a disciplined approach to investigate failures before they become outages. By proactively testing how a system responds under stress, you can identify and fix failures before they end up in the news.
- This novel principle also allows engineers to compare and simulate what they think will happen happens in the systems. You literally “break things on purpose” to learn how to build more resilient systems.
- Interestingly, many big tech firms which boast of distributed systems and microservices architecture rely on Chaos Engineering. Some of the large tech companies are LinkedIn, Netflix, Facebook, Amazon, Google and Microsoft among others.
- Chaos Engineering is about running a series of planned experiments which enable engineers to learn how the systems behave in the face of downtime and outages
- The main technical reason for deploying Chaos Engineering is that insight from these simulated experiments can lead to better understanding of underlying architecture. It will also greatly reduce on-call burden, improve understanding of system failure mode and reduce maintenance costs.
- One of the most popular experiments Gremlin provides is dubbed — Unknown-Unknowns wherein one shuts down an entire cluster in the main region. The teams will shut down two replicas of the cluster at the same time, and after gathering the meantime over a couple of months, it will determine how to build clone two new replicas off the primary cluster.