Chaos engineering: An introduction

Kumar Shivam
3 min readJun 14, 2020

Chaos Engineering is the discipline of experimenting on a system
in order to build confidence in the system’s capability
to withstand turbulent conditions in production.

Introduction

With the encouragement of distributed computing or microservice, software systems are changing the game for software engineering. To meet industry need, we are adopting these kinds of practices which increases flexibility and agility in development, the velocity of integration and deployment. Having said that to adopt these approaches we always have a question in our mind.

How much confident we are to put our system in production?

To gain confidence in our system in production requires proactive testing to analyse how a system responds under stress or turbulent. So, that we can identify and fix the failure before end up any outage.

Chaos Engineering is a nascent term for wider application of these techniques. By running experiments on distributed systems in production, we’re able to build confidence and trust that those systems work as expected under turbulent conditions. We can say literally “breaking things on purpose” to learn how to build more resilient systems.

Chaos engineering can be used to achieve resilience against:

· Infrastructure failures — Cloud is all about redundancy and fault -tolerance. Since no single component can guarantee 100% uptime, all depends on our architecture. The digital solution involves interdependencies because of the diversity of platforms that exists in a given organization. System complexity is becoming exponentially more difficult to plan, especially as infrastructure is everywhere. Chaos Monkey is a tool developed by Netflix to test the resilience of infrastructure.

· Network failures — In a distributed system, it is possible that under some rear or even unknown situation services or APIs will be crippled. This might be due to network error.

· Application failures — Application failure may happen at any point in time and the reason could be anything

1. API failure

2. DB failure

3. Caching Failure etc.

Tools for Chaos Engineering

· Chaos Monkey: Chaos Monkey is a resiliency tool that helps applications tolerate random instance failures
Repo: https://github.com/Netflix/chaosmonkey

· Simian Army: Simian Army consists of services in the cloud for generating various kinds of failures, detecting abnormal conditions, and testing our ability to survive them. The goal is to keep the cloud safe, secure, and highly available. The army includes Chaos Monkey, Janitor Monkey, Conformity Monkey
Repo: https://github.com/Netflix/SimianArmy

· Pumba: Chaos testing and Network emulation tool for docker
Repo: https://github.com/alexei-led/pumba

· PowerfulSeal: A powerful testing tool for Kubernetes cluster
Repo: https://github.com/bloomberg/powerfulseal

· Litmus: Litmus is a chaos engineering tool for stateful workloads on Kubernetes
Repo: https://github.com/openebs/litmus

Benefits of chaos engineering

· Prevent system outage

· Eliminates the need for debugging in a production environment.

· Helps in creating self-healing infrastructure.

· Reduces retries counts from improperly tuned timeouts.

· Monitors application performance.

· Eliminates improper fallbacks.

Conclusion

“The reality is that avoiding chaos engineering is equivalent to embracing crisis engineering”. The purpose of Chaos Engineering is to experience disastrous conditions and analyse the system’s capability so that we can fix the bottlenecks.

--

--

Kumar Shivam

Technical Consultant | Passionate about exploring new Technology | Cyber Security Enthusiast | Technical Blogger | Problem Solver