Chaos Monkey Tutorial: Features, Use Cases, How It Works

Wiz Experts Team

TL;DR, What is Chaos Monkey?

Chaos Monkey is a chaos engineering tool that improves system resilience by proactively introducing failures.

In complex cloud environments, servers can disappear without warning, leading to service disruptions. Chaos Monkey addresses this problem by randomly terminating virtual machine instances and containers in your production environment during business hours. The practice forces engineering teams to design and build resilient systems with proper redundancy and automation, turning the threat of unexpected infrastructure failure into a controlled, routine challenge.

Netflix originally created Chaos Monkey in 2011 during its migration to AWS and open-sourced it in 2012, and the tool has become foundational in the practice of chaos engineering.

How to Prepare for a Cloud Cyberattack: An Actionable Incident Response Plan Template

A quickstart guide to creating a robust incident response plan - designed specifically for companies with cloud-based deployments.

At-A-Glance

  • GitHub: https://github.com/Netflix/chaosmonkey

  • License: Apache-2.0

  • Primary Language: Go

  • Stars: 16.2k ⭐

  • Last Release: January 2025

  • Topics/Tags: chaos-engineering, resiliency, netflix, spinnaker

Common use cases

1. Production Resilience Testing: You can use Chaos Monkey to continuously test and validate the resilience of your applications in a live production environment. By regularly and randomly terminating instances, you can verify that your services are fault-tolerant and can handle the loss of individual components without impacting the end user.

Terminating instances uncovers hidden dependencies, faulty failover logic, and incorrect timeout configurations that are difficult to find in pre-production testing. Teams typically start with less critical services and conservative settings, monitoring key metrics during terminations. As confidence grows, they gradually increase the frequency and expand the scope to include more business-critical applications, making resilience a constantly verified attribute of the production system.

2. Disaster Recovery and Incident Response Validation: Chaos Monkey serves as a tool for drilling and validating disaster recovery (DR) protocols and incident response playbooks. Each instance termination acts as a small, controlled disaster, providing a live-fire exercise for on-call engineers and automated recovery systems. The tool forces teams to test their monitoring and alerting setups, ensuring that the right alarms are triggered promptly.

Chaos Monkey also verifies that automated healing mechanisms, such as instance auto-recovery or load balancer health checks, function as expected. By coordinating these experiments with response teams, you can measure time to detection (TTD) and time to recovery (TTR); identify gaps in procedures; and provide practical, low-stakes training for handling real incidents.

3. Driving Cultural and Architectural Transformation: Beyond its technical function, Chaos Monkey is a tool for cultural change. Making random instance failure a routine event in production fundamentally shifts the developer mindset from designing for uptime to designing for failure. The tool serves as a forcing function, compelling teams to build stateless services, implement proper health checks, and eliminate single points of failure from the outset.

Organizations often accelerate this shift by making a "passing grade" from Chaos Monkey a mandatory gate for production deployment. The result is a shared responsibility model for reliability, where post-termination reviews become collaborative learning sessions, driving a cycle of continuous architectural improvement.

4. Integration into CI/CD Pipelines: To shift resilience testing earlier in the development lifecycle, you can integrate Chaos Monkey's principles directly into CI/CD pipelines. Integration involves running automated chaos experiments in a staging or pre-production environment as a mandatory step before promoting a new release. After a service is deployed to this environment, an automated script triggers a targeted termination of one of its instances.

The pipeline then monitors key performance indicators; if error rates spike or latency exceeds a defined threshold, the pipeline fails, preventing the potentially fragile release from reaching production. Using this workflow ensures that every new code change is validated not only for functional correctness but also for its impact on the system's overall resilience, creating an effective resilience regression test.

5. Validating Auto-Scaling and Healing Mechanisms: Modern cloud-native applications rely on automated scaling and self-healing to maintain performance and availability. Chaos Monkey is an effective tool to rigorously validate that these mechanisms work as designed. By terminating an instance within an auto-scaling group, the termination triggers a real-world test of the entire recovery loop.

Teams can observe whether the group correctly identifies the instance loss, whether it launches a replacement in a timely manner, and how long it takes for the new instance to become healthy and start serving traffic. Using Chaos Monkey this way helps fine-tune auto-scaling policies, health check configurations, and instance warm-up procedures, exposing issues like slow bootstrap times or misconfigured health checks that could lead to service degradation.

How does Chaos Monkey work?

Chaos Monkey operates as a command-line tool triggered by a daily cron job. Chaos Monkey begins by generating a randomized schedule of instance terminations for the day, storing this plan in a MySQL database. For each planned termination, the tool establishes a specific cron job. When a termination job is executed, Chaos Monkey queries Spinnaker for the application's current state, validates potential targets against configured constraints, and finally instructs Spinnaker to terminate a random, eligible instance, such as an AWS EC2 or GCE VM.

Key operations include:

  • Daily Schedule Generation: A primary cron job runs the Go binary once a day to create a randomized termination plan based on mean time between failures and grouping rules.

  • Persistent State Management: A MySQL database acts as the system's memory, tracking planned terminations, recording execution history, and enforcing safety constraints.

  • Spinnaker Integration: Chaos Monkey relies on Spinnaker's APIs as its execution engine to fetch real-time deployment data and perform the actual instance terminations across different cloud providers.

Core Capabilities:

1. Randomized Instance Termination: The foundational capability of Chaos Monkey is its ability to simulate real-world infrastructure failures by randomly selecting and terminating production instances within a defined scope. The process is not entirely chaotic; it operates within configurable parameters to ensure experiments remain controlled. The tool can target specific applications, instance groups, or cloud accounts, allowing teams to isolate tests to particular services. 

By scheduling these terminations during normal business hours, it ensures that engineering teams are available to observe system behavior, respond to any unexpected degradation, and capture valuable learnings from the event. Proactive failure injection forces services to be built with resilience as a primary concern, validating that redundancy, failover mechanisms, and self-healing capabilities function as expected under stress. 

The unpredictability of which specific instance will be terminated prevents teams from building brittle, hard-coded solutions and instead encourages the development of robust, fault-tolerant architectures capable of withstanding the loss of individual compute resources, whether they are virtual machines or containers. The system's ability to balance randomness with control provides a safe yet effective environment for uncovering hidden weaknesses in distributed systems before they can manifest as customer-facing outages. The core function serves as a constant, automated validation of a system's resilience hypothesis.

2. Deep Integration with Spinnaker: Chaos Monkey's effectiveness and operational scalability come from its integration with Spinnaker, the open-source, multi-cloud continuous delivery platform. The integration positions Spinnaker as the central control plane for all chaos engineering activities, providing a unified interface for configuration, execution, and observability. Through the Spinnaker UI, you can define which applications are opted-in for chaos testing, configure schedules, set termination policies, and establish safety constraints. 

Chaos Monkey leverages Spinnaker's understanding of the application topology and infrastructure to discover targetable instances across various cloud providers and environments. The tool uses Spinnaker's APIs to gather metadata about applications, clusters, and server groups, ensuring that terminations are context-aware and precise. Spinnaker also orchestrates the execution of the termination command, abstracting away the underlying cloud-specific APIs. The tight coupling simplifies the management of chaos experiments and enables powerful features like multi-cloud support and the ability to link chaos events directly to deployment pipelines. By embedding chaos engineering within the continuous delivery platform, you can treat resilience testing as an integral part of the software development lifecycle rather than a separate activity.

3. Schedule-Based and Controlled Execution: To balance the need for unpredictable failure simulation with business operations, Chaos Monkey uses a schedule-based execution model. Rather than running continuously or at random times, the tool operates within a configurable daily window, typically aligned with standard business hours. 

A cron job generates a randomized termination schedule for the upcoming day, ensuring that while the exact timing of a failure is unknown, the overall period of potential disruption is predictable. The “predictable chaos” model is critical for its adoption. The model allows on-call engineers and development teams to be prepared and available to observe, learn from, and respond to any issues that arise, transforming potential incidents into real-time learning opportunities. 

The scheduling system is highly configurable, allowing you to define specific start and end times, select which days of the week the tool should be active, and respect designated maintenance windows or holiday periods. A high level of control ensures that chaos experiments do not interfere with critical business events or planned system upgrades, fostering a safe environment for experimentation.

4. Robust Constraint and Safety Controls: A core design principle of Chaos Monkey is to conduct experiments safely, preventing a controlled test from escalating into a widespread outage. Chaos Monkey achieves this through a multi-layered system of constraint and safety controls that act as guardrails. You and your application owners can configure these safeguards to define the “blast radius” of any potential failure. 

Key controls include setting a minimum time interval between consecutive terminations for a given application, preventing a rapid series of failures that could overwhelm recovery mechanisms. Another crucial constraint is the ability to limit the number or percentage of instances that can be terminated within a specific group or cluster, ensuring that the service maintains sufficient capacity to handle user traffic. Furthermore, the system provides exclusion capabilities, allowing teams to explicitly mark certain critical applications or sensitive environments as exempt from termination. 

Controls enable a progressive approach to chaos engineering, where teams can start with very conservative settings and gradually increase the intensity of testing as they build more resilient systems and gain confidence.

5. Multi-Cloud and Multi-Platform Support: Chaos Monkey addresses complex IT landscapes by offering broad support for multi-cloud and multi-platform environments, a capability primarily delivered through its integration with Spinnaker. Chaos Monkey can target resources regardless of where they are running, such as Amazon Web Services (AWS) EC2 instances, Google Cloud Platform (GCP) Compute Engine VMs, Azure Compute, Kubernetes, or CloudFoundry. (As long as Spinnaker manages the container orchestration platform, Chaos Monkey can treat container instances as first-class targets for termination.) 

Simply put, the tool ensures that organizations can apply consistent and standardized resilience testing practices across their entire technology stack, from legacy applications on VMs to modern microservices in containers.

IR Playbook [Template]: AWS Ransomware Attacks

This IR Playbook Template provides a detailed, seven-step approach to manage ransomware incidents across AWS environments, helping you control, contain, and recover from attacks.

Getting Started:

Step 1:

Ensure you have Go installed on your system. For installation instructions, see https://golang.org/doc/install.

Step 2: Install Chaos Monkey by running:

go get github.com/netflix/chaosmonkey/cmd/chaosmonkey

Step 3: Confirm the installation by checking the binary in your GOPATH's bin directory. Add it to your PATH if necessary.

Step 4: To run Chaos Monkey, ensure your applications are managed with Spinnaker and that you have configured it according to your environment.

Step 5: For further configuration and deployment steps, refer to the official docs at https://netflix.github.io/chaosmonkey.

Verified Chaos Monkey User Reviews

Reddit

  • "Netflix's Chaos Monkey is 'a tool that randomly disables our production instances to make sure we can survive this common type of failure without any customer impact,' Netflix explained. 'The name comes from the idea of unleashing a wild monkey with a weapon in your data center (or cloud region) to randomly shoot down instances and chew through cables—all the while we continue serving our customers without interruption.'" - [idreamofpikas] - /r/todayilearned

  • "Basically, if you can survive failures that you create, you can better handle failures you didn't create. Conversely, if you don't know if you can handle failures you create, how do you know if you can handle random ones. It's kind of like a fire drill for computers." - AlienMushroom - /r/todayilearned

Alternatives

FeatureChaos MonkeyLitmusChaosChao Toolkit
Primary TargetVMs and containers in SpinnakerKubernetesPlatform-agnostic (via extensions)
Experiment DefinitionConfiguration in SpinnakerDeclarative YAML files (ChaosExperiment CRDs)Declarative JSON/YAML files
Execution ModelCron‑scheduled, business‑hours terminationsOn-demand or GitOps-drivenCLI-driven or via automation pipelines
ExtensibilityLimited, primarily through SpinnakerHighly extensible with custom probes and experimentsHighly extensible through a driver and extension model

FAQs