Kubernetes incident response: A security playbook

Wiz エキスパートチーム
Key takeaways
  • Define incident response plan early and establish clear roles between security, platform, and development teams

  • Deploy comprehensive monitoring and logging across control plane, nodes, and workloads before incidents occur

  • Kubernetes incidents differ from traditional infrastructure due to ephemeral workloads and dynamic container lifecycles

  • Focus on rapid containment and preserving forensic evidence from short-lived containers

  • Implement automated response capabilities to match the speed of cloud-native attacks

Understanding Kubernetes security incidents

Kubernetes security incidents differ fundamentally from traditional IT breaches. Containers and pods are ephemeral—some containers live for only seconds or minutes. They're created, destroyed, and moved within seconds, making it far harder to track attacks compared to static servers.

Common Kubernetes security incidents include:

  • Container escapes: Attackers break out of isolated containers to access the host system

  • Exposed API servers: Misconfigured authentication or overly permissive RBAC enables unauthorized access and potentially cluster-wide control

  • Compromised service accounts: Used to move laterally and access sensitive resources

  • Supply chain attacks: Malicious code hidden in seemingly legitimate container images gets deployed across your infrastructure

Advanced Container Security Best Practices [Cheat Sheet]

Apply the right open-source tools and policies for your Kubernetes, Docker, or cloud-native container environments with this Cheat Sheet

Incident detection and initial assessment

Effective detection starts with proper logging and Kubernetes monitoring. This is critical since median detection time exceeds 40 minutes for production incidents. Enable audit logging on your API server and configure an audit policy defining which requests to record (stages, users, resources, verbs).

Real-time threat detection requires tools that analyze data as it arrives. Falco monitors system calls and alerts on unusual activity. Set up log aggregation to collect data from multiple sources and correlate events across your environment.

Graph-based context for faster triage: Modern detection systems correlate audit logs, runtime signals, identity permissions, and network exposure into a unified security graph. This connects related entities—linking a suspicious pod to its service account, RBAC role, accessible secrets, and external IPs. Graph-based correlation can significantly reduce false positives—often by as much as 70–80%—by distinguishing isolated anomalies from genuine attack paths.

Multi-Cloud Log Source Mapping

ComponentAWSAzureGCP
Cloud API callsCloudTrailActivity LogsCloud Audit Logs
Managed K8s control planeEKS control plane logsAKS diagnostics logsGKE audit logs
Network flowsVPC Flow LogsNSG Flow LogsVPC Flow Logs
Identity/IAMCloudTrail IAM eventsAzure AD logsCloud IAM audit logs
Load balancerALB/NLB access logsApplication Gateway logsCloud Load Balancing logs
DNS queriesRoute 53 query logsAzure DNS analyticsCloud DNS logs

Node and workload logs (provider-agnostic):

  • Kubelet logs: /var/log/kubelet.log or journalctl -u kubelet

  • Container runtime: /var/log/containerd.log or crictl logs

  • Application logs: kubectl logs or centralized via Fluentd/Fluent Bit

  • Kubernetes audit logs: /var/log/kube-apiserver-audit.log (self-managed) or managed service audit logs

Kubernetes IR First-60-Minutes Checklist:

Immediate Actions (0-15 min):

  • Cordon affected nodes (kubectl cordon)

  • Apply deny-all NetworkPolicy to compromised namespace

  • Capture node and volume snapshots

  • Collect container logs and events

  • Document initial indicators and timeline

Evidence Collection (15-30 min):

  • Export audit logs for affected timeframe

  • Dump process memory from running containers

  • Copy container writable layers

  • Capture network connections

  • Preserve pod specifications

Containment (30-45 min):

  • Rotate compromised service account tokens

  • Revoke suspicious RBAC bindings

  • Drain affected nodes after evidence capture

  • Block malicious IPs at cloud firewall level

Communication (45-60 min):

  • Notify incident commander and stakeholders

  • Update incident ticket with findings

  • Coordinate with cloud provider if needed

  • Document blast radius and affected services

RACI Matrix:

  • Responsible: On-call security engineer

  • Accountable: Security team lead

  • Consulted: Platform team, affected service owners

  • Informed: CISO, compliance team

Rapid Triage Commands:

Cluster-wide assessment:

# Recent events sorted by time kubectl get events --all-namespaces --sort-by=.lastTimestamp # All pods with node placement kubectl get pods -A -o wide # Current RBAC permissions audit kubectl auth can-i --list --as=system:serviceaccount:default:suspicious-sa

Container runtime inspection:

# List running containers crictl ps # Inspect container details crictl inspect # View container logs crictl logs

Node-level forensics:

# Active network connections ss -tunap | grep # Process tree ps auxf | grep # Recent file modifications find /var/lib/containerd -type f -mmin -60

Cloud provider snapshots:

# AWS EBS snapshot aws ec2 create-snapshot --volume-id --description "IR-evidence-$(date +%Y%m%d-%H%M)" # Azure disk snapshot az snapshot create --resource-group --source --name ir-snapshot-$(date +%s) # GCP persistent disk snapshot gcloud compute disks snapshot --snapshot-names=ir-snapshot-$(date +%s)
Take a tour of Wiz

Learn what makes Wiz the platform to enable your cloud security operation

Rapid containment and isolation strategies

Agentless visibility for rapid blast radius assessment: Before applying containment, identify all affected workloads. Agentless inventory tools can scan your cluster without requiring agents in every pod, quickly finding all workloads sharing the compromised image, namespace, or node. This complete view enables comprehensive NetworkPolicy application and cordons, preventing attacker pivots to overlooked workloads.

When you detect an incident, stop the attack from spreading. NetworkPolicies provide rapid containment when your CNI plugin supports them (Calico, Cilium, Weave Net). Apply a deny-all policy to compromised pods to isolate them—note that pods using hostNetwork bypass NetworkPolicy controls and require node-level firewall rules.

Immediately apply a deny-all NetworkPolicy to affected namespaces or pods, cutting off attacker communication. Use kubectl cordon to mark affected nodes as unschedulable, preventing new workloads on potentially compromised infrastructure.

Preserve forensic evidence before moving workloads. Use kubectl cordon to prevent new scheduling, then capture node snapshots, collect logs, dump memory, and copy container layers. Only after evidence collection should you use kubectl drain to evict workloads to clean nodes.

Forensic investigation in dynamic environments

Container forensics requires specialized tools for dynamic environments. For CRI-based runtimes (containerd, CRI-O), use crictl to inspect containers and read-only filesystem mounts to analyze layers without altering evidence. Tools like kube-forensics orchestrate collection across nodes, while container-diff identifies malicious modifications.

eBPF runtime telemetry for low-overhead forensics: eBPF sensors run in the kernel and capture process execution, file access, and network activity with <1% CPU overhead. Unlike traditional agents, eBPF observes system calls in real-time without modifying code. This is critical for Kubernetes forensics because containers live for seconds—eBPF captures process trees, command-line arguments, and connections before container termination.

Critical forensic steps:

  • Volume and node snapshots: Capture cloud volume snapshots (AWS EBS, Azure Managed Disks, GCP Persistent Disks), node root disk snapshots, and container writable layers before pod termination

  • Memory dumps: Preserve running processes and network connections

  • Log collection: Gather all relevant log files

  • Network analysis: Examine traffic patterns and connection attempts

Common Kubernetes Security Scenarios:

Cryptomining Detection and Response:

Indicators: High CPU usage (>80% sustained), outbound connections to mining pools, suspicious processes (xmrig, minerd)

Immediate containment:

# Block mining pool domains via NetworkPolicy apiVersion: networking.k8s.io/v1 kind: NetworkPolicy metadata: name: block-mining-egress spec: podSelector: {} policyTypes: - Egress egress: - to: - podSelector: {} ports: - protocol: TCP port: 443

Evidence collection: Process list, network connections, container image SHA, deployment source

Root cause remediation: Scan image for vulnerabilities, review RBAC, enforce resource limits, require image signing

Exposed Kubernetes Dashboard:

Indicators: Unauthenticated access, suspicious token creation, unexpected cluster-admin bindings

Immediate containment: Delete dashboard service, revoke tokens, audit RBAC changes

Evidence collection: Dashboard access logs, API audit logs, source IPs from flow logs

Root cause remediation: Redeploy with authentication, restrict to internal network, implement SSO

Compromised Service Account:

Indicators: Service account used from unexpected IPs, unusual API calls, privilege escalation attempts

Immediate containment: Delete token secrets, remove RBAC bindings, cordon nodes where SA was used

Evidence collection: Audit logs filtered by SA name, pod specifications, network flow logs

Root cause remediation: Implement least-privilege RBAC, enable workload identity, rotate tokens

Actionable Kubernetes Security Best Practices [Cheat Sheet]

Learn how to apply advanced Kubernetes security techniques across clusters, workloads, and infrastructure. Strengthen data, identity, and network protection using practical, real-world configurations.

Advanced threat hunting and analysis

Proactive threat hunting involves actively searching for missed compromise signs. Regularly analyze Kubernetes audit logs for suspicious API calls, particularly from service accounts behaving unusually.

Look for service accounts suddenly creating resources they don't normally need, like new roles or secrets. Anonymous access attempts signal potential reconnaissance or exploitation. Disable anonymous authentication (--anonymous-auth=false) unless required for health checks, enforce least-privilege RBAC for system:anonymous, and investigate source IPs and timing in audit logs. Unusual authentication patterns—like logins from unexpected locations or odd times—often indicate compromised credentials.

Behavioral analysis helps establish baselines for normal activity and spot attack indicators. Monitor resource usage patterns, network flows, and user access behaviors across your cluster.

Cross-cluster incident coordination

Multiple Kubernetes clusters create unique coordination challenges. Without centralized visibility, security teams waste time switching between tools. Establish unified logging by assigning unique cluster IDs, using consistent labels (environment, team, service), and centralizing logs into a SIEM or SOAR platform. This enables cross-cluster correlation—tracking compromised service accounts across dev and prod clusters.

Consistent security policies across all environments are essential. Development, staging, and production clusters should use the same security controls and response procedures, making automation easier and reducing configuration errors during incidents.

Establish clear escalation paths and ensure all team members can access centralized incident management tools.

Automated response and orchestration

Manual incident response is too slow for cloud-native environments where attacks spread in seconds. Admission controllers act as API server gatekeepers, validating workloads before deployment. Pod Security Admission (PSA) enforces three security profiles (privileged, baseline, restricted) at the namespace level. Third-party controllers like OPA Gatekeeper and Kyverno add custom policy enforcement.

Policy-as-Code tools like OPA and Kyverno integrate with admission controllers to enforce custom security rules. They automatically block containers with excessive privileges or prevent unapproved image deployment. GitOps practices enable automated remediation workflows that revert malicious changes to their last secure state.

Code-to-cloud traceability for root cause remediation: Automated response systems should trace runtime incidents to their origin—the container image, IaC template, Git repository, and CI/CD pipeline that deployed the vulnerable workload. When you detect a container with excessive privileges, the system identifies the Helm chart or Terraform module that created it, the Git commit introducing the misconfiguration, and the owning team. This enables remediation tickets with full context, source template fixes, and prevention of future recurrence.

Key automation capabilities:

  • Automatic policy enforcement: Block risky deployments before production

  • Incident escalation: Route alerts to appropriate teams based on severity

  • Evidence collection: Automatically gather logs and snapshots

  • Rollback procedures: Quickly revert to known-good configurations

Recovery and post-incident activities

Recovery begins after containing the threat and eliminating attacker access. Root cause analysis is essential for understanding how the breach occurred and preventing similar incidents. Examine configuration drift between your intended infrastructure state and what was actually running.

Deployment history analysis identifies when vulnerabilities were introduced and how they went undetected. This information improves security controls and detection capabilities. Post-incident reviews should involve all relevant teams to document and share lessons learned.

The recovery process includes updating security policies, improving detection rules, and strengthening failed controls. Conduct tabletop exercises to test updated procedures and ensure team members understand their roles.

Building a proactive Kubernetes security program

Proactive security prevents incidents rather than just responding to them. Continuous vulnerability scanning and configuration assessments, combined with least privilege enforcement (dropping unnecessary Linux capabilities), image signing verification (cosign, Notary v2), and SBOM attestation help prevent exploitation—critical since only 21% disable insecure Linux capabilities. This shift-left approach catches problems early when they're easier and cheaper to fix.

Baseline Security Policies for Incident Prevention:

Pod Security Admission (namespace-level):

apiVersion: v1 kind: Namespace metadata: name: production labels: pod-security.kubernetes.io/enforce: restricted pod-security.kubernetes.io/audit: restricted pod-security.kubernetes.io/warn: restricted

The "restricted" profile blocks privileged containers, host namespaces, and insecure capabilities—preventing 80% of common container escapes.

Default-deny NetworkPolicy:

apiVersion: networking.k8s.io/v1 kind: NetworkPolicy metadata: name: default-deny-all namespace: production spec: podSelector: {} policyTypes: - Ingress - Egress

Apply to all namespaces, then explicitly allow required traffic. This limits lateral movement during incidents.

Image signature verification (Kyverno):

apiVersion: kyverno.io/v1 kind: ClusterPolicy metadata: name: require-signed-images spec: validationFailureAction: enforce rules: - name: verify-signature match: resources: kinds: - Pod verifyImages: - imageReferences: - "\*" attestors: - entries: - keys: publicKeys: |- -----BEGIN PUBLIC KEY----- -----END PUBLIC KEY-----

Blocks unsigned images, preventing supply chain attacks.

Required labels for ownership:

apiVersion: kyverno.io/v1 kind: ClusterPolicy metadata: name: require-labels spec: validationFailureAction: enforce rules: - name: check-labels match: resources: kinds: - Deployment - StatefulSet validate: message: "Deployments must have team and owner labels" pattern: metadata: labels: team: "?\*" owner: "?\*"

Ensures clear ownership for incident escalation and paging.

Security champions within development teams embed security practices throughout your organization. They advocate for secure coding, help colleagues understand security risks, and provide feedback about practical developer challenges.

Regular security assessments and penetration testing validate your security controls. These exercises identify defense gaps and ensure incident response procedures work under realistic conditions.

Compliance Considerations for Kubernetes IR:

SOC 2 Type II requirements:

  • CC6.1 (Logical Access): Document RBAC policies and service account usage

  • CC7.2 (System Monitoring): Implement audit logging and alerting

  • CC7.3 (Incident Response): Maintain documented IR procedures and evidence retention

  • CA1.1 (Confidentiality): Encrypt sensitive data in etcd and persistent volumes

ISO 27001 Annex A controls:

  • A.12.4.1 (Event Logging): Enable comprehensive audit logging across control plane and nodes

  • A.16.1.4 (Incident Assessment): Document incident classification and escalation procedures

  • A.16.1.5 (Incident Response): Maintain IR playbooks and conduct regular tabletop exercises

  • A.16.1.7 (Evidence Collection): Preserve forensic evidence per legal and regulatory requirements

PCI DSS (for payment processing workloads):

  • Requirement 10: Log all access to cardholder data environments

  • Requirement 10.6: Review logs daily for anomalies

  • Requirement 12.10: Implement and test incident response plan quarterly

HIPAA (for healthcare workloads):

  • §164.308(a)(6): Implement security incident procedures

  • §164.312(b): Maintain audit controls and logs

  • §164.308(a)(1)(ii)(D): Conduct regular risk assessments

Practical implementation: Map IR procedures to required controls, document evidence collection and retention policies, and conduct annual compliance audits of your Kubernetes security posture.

How Wiz transforms Kubernetes incident response

Wiz Defend provides real-time Kubernetes detection and response with high-fidelity detections curated by Wiz Research, reducing blind spots in dynamic container environments. The platform prioritizes precision over volume—detections correlate multiple signals (process execution, network connections, file access, API calls) to identify genuine threats while filtering out benign anomalies that generate false positives.

The Wiz Security Graph automatically correlates runtime threats with cloud context, showing complete attack paths from compromised containers to critical assets like admin accounts or sensitive data stores. This contextual approach enables faster incident scoping and more accurate risk assessment during active investigations.

Wiz's lightweight eBPF Runtime Sensor captures forensic evidence from ephemeral containers without performance impact. The Investigation Graph visualizes complete attack timelines and blast radius automatically, reducing mean time to investigate from hours to minutes.

The platform's code-to-cloud correlation traces runtime incidents back to vulnerable source code and Infrastructure as Code templates, enabling true root cause remediation. This capability helps development teams fix underlying issues at their source, preventing the same vulnerabilities from being reintroduced in future deployments.

Request a demo to see Kubernetes incident detection, the Investigation Graph for attack path visualization, and runtime forensics with eBPF in action.

See Wiz Container Security in Action

Learn why the fastest growing companies choose Wiz to secure containers, Kubernetes, and cloud environments from build-time to real-time.

Wiz がお客様の個人データをどのように取り扱うかについては、当社のプライバシーポリシーをご確認下さい: プライバシーポリシー.

FAQs about Kubernetes incident response