MITB Banner

What Is Chaos Engineering: A Deep Dive Into These Engineer-Build Resilient Systems

Share

Chaos engineering initially took off as a means of understanding the distributed software systems. It popularised the concept of proactive failure testing to build better systems and birthed the concept of “failure as a service”,  or “resilience as a service”. Today, site Reliability Engineers have become familiar with the basics of proactive failure testing as a means to create better systems.

Chaos Engineering is an attempt to find weaknesses before they appear in system-wide, aberrant behaviours. There are fallouts in system weaknesses wherein a service could be unavailable, outages due to downstream dependency getting too much traffic and cascading failures when a single point of failure crashes.

Chaos Engineering was popularised by engineers at Netflix whose primary goal was to build stable, secure and bug-free software. To achieve this, Netflix engineers introduced a tool called Chaos Monkey built on the principles of a model that can be applied across different projects and different departments to make IT services more resilient. The Netflix infrastructure is built on this methodology which allows engineers to significantly improve their systems without comprising the complexity of the system. Chaos Engineering also accelerates flexibility and rapid development.

Building Resilient Systems

Netflix engineers published a document — Principles of Chaos Engineering which detailed how the empirical system based approach helped systems withstand outages. It also helped engineers understand the complex behaviour of distributed systems and helped them observe the systems in a controlled environment. Over a period of time, the team used these learnings to strengthen the microservices infrastructure and developed the concept of Chaos Engineering.

Citing an example, the team revealed how Chaos Engineering was put into effect for one of their services — subscriber, which is used to handle user management activities and authentication. Due to unforeseen situations such as downstreaming services, subscriber service may go out of control. To minimise the fallback, the team devised a strategy to improve resiliency around the product so that customers can rely on it even during downtime. To reduce latency from traffic, the team observed deviations in two groups:

  1. Control group
  2. Variable group

Why Companies Are Relying On Chaos Engineering

The rise of microservices and cloud architectures has led to increased complexity in infrastructure with the systems becoming prone to outages and failures that leads to revenue losses. According to Gremlin, provides a framework for Failure-as-a-Service, even brief outages can heavily impact the bottom line revenue, so reducing the cost of downtime is becoming a KPI for many engineering teams. Gremlin indicated how in 2017, 98% of organizations said a single hour of downtime would costed their businesses over $100,000. The company cited an example of British Airways that suffered an outage in May 2017 which left thousands of passengers stranded and also cost the company around $102.19 million.

This California-headquartered firm offers a full suite of enterprise failure testing solutions so that engineers can find out how resilient their production system is. Distributed systems are more complex than monolithic systems and it is harder to predict when and how they will fail. Some of the top drawbacks of distributed systems are unreliability of network, zero latency and that the network is homogeneous.

Here are some of the top reasons for deploying failure-as-a-Service

  • Chaos Engineering is billed as a disciplined approach to investigate failures before they become outages. By proactively testing how a system responds under stress, you can identify and fix failures before they end up in the news.
  • This novel principle also allows engineers to compare and simulate what they think will happen happens in the systems. You literally “break things on purpose” to learn how to build more resilient systems.
  • Interestingly, many big tech firms which boast of distributed systems and microservices architecture rely on Chaos Engineering.  Some of the large tech companies are LinkedIn, Netflix, Facebook, Amazon, Google and Microsoft among others.
  • Chaos Engineering is about running a series of planned experiments which enable engineers to learn how the systems behave in the face of downtime and outages
  • The main technical reason for deploying Chaos Engineering is that insight from these simulated experiments can lead to better understanding of underlying architecture. It will also greatly reduce on-call burden, improve understanding of system failure mode and reduce maintenance costs.
  • One of the most popular experiments Gremlin provides is dubbed — Unknown-Unknowns wherein one shuts down an entire cluster in the main region. The teams will shut down two replicas of the cluster at the same time, and after gathering the meantime over a couple of months, it will determine how to build clone two new replicas off the primary cluster.

 

Share
Picture of Richa Bhatia

Richa Bhatia

Richa Bhatia is a seasoned journalist with six-years experience in reportage and news coverage and has had stints at Times of India and The Indian Express. She is an avid reader, mum to a feisty two-year-old and loves writing about the next-gen technology that is shaping our world.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.