Kubernetes Incident Response: A Security Playbook

Understanding Kubernetes security incidents

Kubernetes security incidents differ fundamentally from traditional IT breaches. Containers and pods are ephemeral—some containers live for only seconds or minutes. They're created, destroyed, and moved within seconds, making it far harder to track attacks compared to static servers.

Common Kubernetes security incidents include:

Container escapes: Attackers break out of isolated containers to access the host system
Exposed API servers: Misconfigured authentication or overly permissive RBAC enables unauthorized access and potentially cluster-wide control
Compromised service accounts: Used to move laterally and access sensitive resources
Supply chain attacks: Malicious code hidden in seemingly legitimate container images gets deployed across your infrastructure

Advanced Container Security Best Practices [Cheat Sheet]

Apply the right open-source tools and policies for your Kubernetes, Docker, or cloud-native container environments with this Cheat Sheet

Incident detection and initial assessment

Effective detection starts with proper logging and Kubernetes monitoring. This is critical since median detection time exceeds 40 minutes for production incidents. Enable audit logging on your API server and configure an audit policy defining which requests to record (stages, users, resources, verbs).

Real-time threat detection requires tools that analyze data as it arrives. Falco monitors system calls and alerts on unusual activity. Set up log aggregation to collect data from multiple sources and correlate events across your environment.

Graph-based context for faster triage: Modern detection systems correlate audit logs, runtime signals, identity permissions, and network exposure into a unified security graph. This connects related entities—linking a suspicious pod to its service account, RBAC role, accessible secrets, and external IPs. Graph-based correlation can significantly reduce false positives—often by as much as 70–80%—by distinguishing isolated anomalies from genuine attack paths.

Multi-Cloud Log Source Mapping

Component	AWS	Azure	GCP
Cloud API calls	CloudTrail	Activity Logs	Cloud Audit Logs
Managed K8s control plane	EKS control plane logs	AKS diagnostics logs	GKE audit logs
Network flows	VPC Flow Logs	NSG Flow Logs	VPC Flow Logs
Identity/IAM	CloudTrail IAM events	Azure AD logs	Cloud IAM audit logs
Load balancer	ALB/NLB access logs	Application Gateway logs	Cloud Load Balancing logs
DNS queries	Route 53 query logs	Azure DNS analytics	Cloud DNS logs

Node and workload logs (provider-agnostic):

Kubelet logs: /var/log/kubelet.log or journalctl -u kubelet
Container runtime: /var/log/containerd.log or crictl logs
Application logs: kubectl logs or centralized via Fluentd/Fluent Bit
Kubernetes audit logs: /var/log/kube-apiserver-audit.log (self-managed) or managed service audit logs

Kubernetes IR First-60-Minutes Checklist:

Immediate Actions (0-15 min):

Cordon affected nodes (kubectl cordon)
Apply deny-all NetworkPolicy to compromised namespace
Capture node and volume snapshots
Collect container logs and events
Document initial indicators and timeline

Evidence Collection (15-30 min):

Export audit logs for affected timeframe
Dump process memory from running containers
Copy container writable layers
Capture network connections
Preserve pod specifications

Containment (30-45 min):

Rotate compromised service account tokens
Revoke suspicious RBAC bindings
Drain affected nodes after evidence capture
Block malicious IPs at cloud firewall level

Communication (45-60 min):

Notify incident commander and stakeholders
Update incident ticket with findings
Coordinate with cloud provider if needed
Document blast radius and affected services

RACI Matrix:

Responsible: On-call security engineer
Accountable: Security team lead
Consulted: Platform team, affected service owners
Informed: CISO, compliance team

Rapid Triage Commands:

Cluster-wide assessment:

# Recent events sorted by time kubectl get events --all-namespaces --sort-by=.lastTimestamp # All pods with node placement kubectl get pods -A -o wide # Current RBAC permissions audit kubectl auth can-i --list --as=system:serviceaccount:default:suspicious-sa

Container runtime inspection:

# List running containers crictl ps # Inspect container details crictl inspect # View container logs crictl logs

Node-level forensics:

# Active network connections ss -tunap | grep # Process tree ps auxf | grep # Recent file modifications find /var/lib/containerd -type f -mmin -60

Cloud provider snapshots:

# AWS EBS snapshot aws ec2 create-snapshot --volume-id --description "IR-evidence-$(date +%Y%m%d-%H%M)" # Azure disk snapshot az snapshot create --resource-group --source --name ir-snapshot-$(date +%s) # GCP persistent disk snapshot gcloud compute disks snapshot --snapshot-names=ir-snapshot-$(date +%s)

Take a tour of Wiz

Learn what makes Wiz the platform to enable your cloud security operation

Rapid containment and isolation strategies

Agentless visibility for rapid blast radius assessment: Before applying containment, identify all affected workloads. Agentless inventory tools can scan your cluster without requiring agents in every pod, quickly finding all workloads sharing the compromised image, namespace, or node. This complete view enables comprehensive NetworkPolicy application and cordons, preventing attacker pivots to overlooked workloads.

When you detect an incident, stop the attack from spreading. NetworkPolicies provide rapid containment when your CNI plugin supports them (Calico, Cilium, Weave Net). Apply a deny-all policy to compromised pods to isolate them—note that pods using hostNetwork bypass NetworkPolicy controls and require node-level firewall rules.

Immediately apply a deny-all NetworkPolicy to affected namespaces or pods, cutting off attacker communication. Use kubectl cordon to mark affected nodes as unschedulable, preventing new workloads on potentially compromised infrastructure.

Preserve forensic evidence before moving workloads. Use kubectl cordon to prevent new scheduling, then capture node snapshots, collect logs, dump memory, and copy container layers. Only after evidence collection should you use kubectl drain to evict workloads to clean nodes.

wiz academy

Kubernetes Nodes vs Pods: Key Differences Explained

Nodes are the physical or virtual machines that provide computing resources in a Kubernetes cluster, while pods are the smallest deployable units that contain one or more containers

Forensic investigation in dynamic environments

Container forensics requires specialized tools for dynamic environments. For CRI-based runtimes (containerd, CRI-O), use crictl to inspect containers and read-only filesystem mounts to analyze layers without altering evidence. Tools like kube-forensics orchestrate collection across nodes, while container-diff identifies malicious modifications.

eBPF runtime telemetry for low-overhead forensics: eBPF sensors run in the kernel and capture process execution, file access, and network activity with <1% CPU overhead. Unlike traditional agents, eBPF observes system calls in real-time without modifying code. This is critical for Kubernetes forensics because containers live for seconds—eBPF captures process trees, command-line arguments, and connections before container termination.

Critical forensic steps:

Volume and node snapshots: Capture cloud volume snapshots (AWS EBS, Azure Managed Disks, GCP Persistent Disks), node root disk snapshots, and container writable layers before pod termination
Memory dumps: Preserve running processes and network connections
Log collection: Gather all relevant log files
Network analysis: Examine traffic patterns and connection attempts

Common Kubernetes Security Scenarios:

Cryptomining Detection and Response:

Indicators: High CPU usage (>80% sustained), outbound connections to mining pools, suspicious processes (xmrig, minerd)

Immediate containment:

# Block mining pool domains via NetworkPolicy apiVersion: networking.k8s.io/v1 kind: NetworkPolicy metadata: name: block-mining-egress spec: podSelector: {} policyTypes: - Egress egress: - to: - podSelector: {} ports: - protocol: TCP port: 443

Evidence collection: Process list, network connections, container image SHA, deployment source

Root cause remediation: Scan image for vulnerabilities, review RBAC, enforce resource limits, require image signing

Exposed Kubernetes Dashboard:

Indicators: Unauthenticated access, suspicious token creation, unexpected cluster-admin bindings

Immediate containment: Delete dashboard service, revoke tokens, audit RBAC changes

Evidence collection: Dashboard access logs, API audit logs, source IPs from flow logs

Root cause remediation: Redeploy with authentication, restrict to internal network, implement SSO

Compromised Service Account:

Indicators: Service account used from unexpected IPs, unusual API calls, privilege escalation attempts

Immediate containment: Delete token secrets, remove RBAC bindings, cordon nodes where SA was used

Evidence collection: Audit logs filtered by SA name, pod specifications, network flow logs

Root cause remediation: Implement least-privilege RBAC, enable workload identity, rotate tokens

Actionable Kubernetes Security Best Practices [Cheat Sheet]

Learn how to apply advanced Kubernetes security techniques across clusters, workloads, and infrastructure. Strengthen data, identity, and network protection using practical, real-world configurations.

Advanced threat hunting and analysis

Proactive threat hunting involves actively searching for missed compromise signs. Regularly analyze Kubernetes audit logs for suspicious API calls, particularly from service accounts behaving unusually.

Look for service accounts suddenly creating resources they don't normally need, like new roles or secrets. Anonymous access attempts signal potential reconnaissance or exploitation. Disable anonymous authentication (--anonymous-auth=false) unless required for health checks, enforce least-privilege RBAC for system:anonymous, and investigate source IPs and timing in audit logs. Unusual authentication patterns—like logins from unexpected locations or odd times—often indicate compromised credentials.

Behavioral analysis helps establish baselines for normal activity and spot attack indicators. Monitor resource usage patterns, network flows, and user access behaviors across your cluster.

Cross-cluster incident coordination

Multiple Kubernetes clusters create unique coordination challenges. Without centralized visibility, security teams waste time switching between tools. Establish unified logging by assigning unique cluster IDs, using consistent labels (environment, team, service), and centralizing logs into a SIEM or SOAR platform. This enables cross-cluster correlation—tracking compromised service accounts across dev and prod clusters.

Consistent security policies across all environments are essential. Development, staging, and production clusters should use the same security controls and response procedures, making automation easier and reducing configuration errors during incidents.

Establish clear escalation paths and ensure all team members can access centralized incident management tools.

Automated response and orchestration

Manual incident response is too slow for cloud-native environments where attacks spread in seconds. Admission controllers act as API server gatekeepers, validating workloads before deployment. Pod Security Admission (PSA) enforces three security profiles (privileged, baseline, restricted) at the namespace level. Third-party controllers like OPA Gatekeeper and Kyverno add custom policy enforcement.

Policy-as-Code tools like OPA and Kyverno integrate with admission controllers to enforce custom security rules. They automatically block containers with excessive privileges or prevent unapproved image deployment. GitOps practices enable automated remediation workflows that revert malicious changes to their last secure state.

Code-to-cloud traceability for root cause remediation: Automated response systems should trace runtime incidents to their origin—the container image, IaC template, Git repository, and CI/CD pipeline that deployed the vulnerable workload. When you detect a container with excessive privileges, the system identifies the Helm chart or Terraform module that created it, the Git commit introducing the misconfiguration, and the owning team. This enables remediation tickets with full context, source template fixes, and prevention of future recurrence.

Key automation capabilities:

Automatic policy enforcement: Block risky deployments before production
Incident escalation: Route alerts to appropriate teams based on severity
Evidence collection: Automatically gather logs and snapshots
Rollback procedures: Quickly revert to known-good configurations

Recovery and post-incident activities

Recovery begins after containing the threat and eliminating attacker access. Root cause analysis is essential for understanding how the breach occurred and preventing similar incidents. Examine configuration drift between your intended infrastructure state and what was actually running.

Deployment history analysis identifies when vulnerabilities were introduced and how they went undetected. This information improves security controls and detection capabilities. Post-incident reviews should involve all relevant teams to document and share lessons learned.

The recovery process includes updating security policies, improving detection rules, and strengthening failed controls. Conduct tabletop exercises to test updated procedures and ensure team members understand their roles.

wiz academy

What Are Kubernetes Secrets? Uses, Types, and How to Create

A Kubernetes secret is an object in the Kubernetes ecosystem that contains sensitive information (think keys, passwords, and tokens)

Building a proactive Kubernetes security program

Proactive security prevents incidents rather than just responding to them. Continuous vulnerability scanning and configuration assessments, combined with least privilege enforcement (dropping unnecessary Linux capabilities), image signing verification (cosign, Notary v2), and SBOM attestation help prevent exploitation—critical since only 21% disable insecure Linux capabilities. This shift-left approach catches problems early when they're easier and cheaper to fix.

Baseline Security Policies for Incident Prevention:

Pod Security Admission (namespace-level):

apiVersion: v1 kind: Namespace metadata: name: production labels: pod-security.kubernetes.io/enforce: restricted pod-security.kubernetes.io/audit: restricted pod-security.kubernetes.io/warn: restricted

The "restricted" profile blocks privileged containers, host namespaces, and insecure capabilities—preventing 80% of common container escapes.

Default-deny NetworkPolicy:

apiVersion: networking.k8s.io/v1 kind: NetworkPolicy metadata: name: default-deny-all namespace: production spec: podSelector: {} policyTypes: - Ingress - Egress

Apply to all namespaces, then explicitly allow required traffic. This limits lateral movement during incidents.

Image signature verification (Kyverno):

apiVersion: kyverno.io/v1 kind: ClusterPolicy metadata: name: require-signed-images spec: validationFailureAction: enforce rules: - name: verify-signature match: resources: kinds: - Pod verifyImages: - imageReferences: - "\*" attestors: - entries: - keys: publicKeys: |- -----BEGIN PUBLIC KEY----- -----END PUBLIC KEY-----

Blocks unsigned images, preventing supply chain attacks.

Required labels for ownership:

apiVersion: kyverno.io/v1 kind: ClusterPolicy metadata: name: require-labels spec: validationFailureAction: enforce rules: - name: check-labels match: resources: kinds: - Deployment - StatefulSet validate: message: "Deployments must have team and owner labels" pattern: metadata: labels: team: "?\*" owner: "?\*"

Ensures clear ownership for incident escalation and paging.

Security champions within development teams embed security practices throughout your organization. They advocate for secure coding, help colleagues understand security risks, and provide feedback about practical developer challenges.

Regular security assessments and penetration testing validate your security controls. These exercises identify defense gaps and ensure incident response procedures work under realistic conditions.

Compliance Considerations for Kubernetes IR:

SOC 2 Type II requirements:

CC6.1 (Logical Access): Document RBAC policies and service account usage
CC7.2 (System Monitoring): Implement audit logging and alerting
CC7.3 (Incident Response): Maintain documented IR procedures and evidence retention
CA1.1 (Confidentiality): Encrypt sensitive data in etcd and persistent volumes

ISO 27001 Annex A controls:

A.12.4.1 (Event Logging): Enable comprehensive audit logging across control plane and nodes
A.16.1.4 (Incident Assessment): Document incident classification and escalation procedures
A.16.1.5 (Incident Response): Maintain IR playbooks and conduct regular tabletop exercises
A.16.1.7 (Evidence Collection): Preserve forensic evidence per legal and regulatory requirements

PCI DSS (for payment processing workloads):

Requirement 10: Log all access to cardholder data environments
Requirement 10.6: Review logs daily for anomalies
Requirement 12.10: Implement and test incident response plan quarterly

HIPAA (for healthcare workloads):

§164.308(a)(6): Implement security incident procedures
§164.312(b): Maintain audit controls and logs
§164.308(a)(1)(ii)(D): Conduct regular risk assessments

Practical implementation: Map IR procedures to required controls, document evidence collection and retention policies, and conduct annual compliance audits of your Kubernetes security posture.

How Wiz transforms Kubernetes incident response

Wiz Defend provides real-time Kubernetes detection and response with high-fidelity detections curated by Wiz Research, reducing blind spots in dynamic container environments. The platform prioritizes precision over volume—detections correlate multiple signals (process execution, network connections, file access, API calls) to identify genuine threats while filtering out benign anomalies that generate false positives.

The Wiz Security Graph automatically correlates runtime threats with cloud context, showing complete attack paths from compromised containers to critical assets like admin accounts or sensitive data stores. This contextual approach enables faster incident scoping and more accurate risk assessment during active investigations.

Wiz's lightweight eBPF Runtime Sensor captures forensic evidence from ephemeral containers without performance impact. The Investigation Graph visualizes complete attack timelines and blast radius automatically, reducing mean time to investigate from hours to minutes.

The platform's code-to-cloud correlation traces runtime incidents back to vulnerable source code and Infrastructure as Code templates, enabling true root cause remediation. This capability helps development teams fix underlying issues at their source, preventing the same vulnerabilities from being reintroduced in future deployments.

Request a demo to see Kubernetes incident detection, the Investigation Graph for attack path visualization, and runtime forensics with eBPF in action.

See Wiz Container Security in Action

Learn why the fastest growing companies choose Wiz to secure containers, Kubernetes, and cloud environments from build-time to real-time.

Kubernetes incident response: A security playbook

Key takeaways