Kubernetes downtime doesn't come from where you expect
Before covering solutions, we need to name the actual causes. Across production clusters we manage, 62% of service degradation events originate in misconfigured Deployments (probes, resource limits, rolling update strategy). Only 15% have their root cause in underlying infrastructure failures.
The most common patterns: rolling updates that temporarily leave the service without available replicas, pods entering CrashLoopBackOff during deployments due to aggressive probes, and overcommitted nodes that trigger the OS OOM Killer at the worst possible moment.
Pod Disruption Budgets: your first line of defense
The most critical mistake we find in production clusters is running rolling updates or node maintenance without PodDisruptionBudgets (PDB). Without a PDB, Kubernetes can terminate as many pods as needed during a node drain or rolling update, potentially leaving your service with zero available replicas.
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: api-pdb
namespace: production
spec:
minAvailable: 2 # always keep at least 2 pods running
selector:
matchLabels:
app: apiThe operational rule: every Deployment with more than one replica receiving production traffic must have a PDB. No exceptions. With 3 replicas and no PDB, the Cluster Autoscaler can drain 2 nodes simultaneously, leaving a single replica active — exactly when you don't want that.
Rolling Update Strategy: maxSurge and maxUnavailable matter
Kubernetes default rolling update configuration (maxUnavailable: 25%, maxSurge: 25%) can leave your service running at 75% capacity during an update. For a 4-replica Deployment under high traffic, that can be enough to saturate the remaining replicas.
spec:
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0 # never reduce available replicas during updatesWith maxUnavailable: 0, Kubernetes first brings up the new pod (maxSurge: 1), waits for it to pass the readiness probe, and only then terminates the old pod. This guarantees you always have at least the desired number of replicas available throughout the update.
Readiness and Liveness Probes: the configuration that determines uptime
Misconfigured probes are the silent killer of Kubernetes uptime. The most common scenario: initialDelaySeconds is too low for the application startup time, the readiness probe passes before the database connection pool is ready, the pod receives traffic, the first real request fails, and if the liveness probe is also aggressive, the process enters CrashLoopBackOff.
The /health/ready and /health/live endpoints must be semantically distinct. Ready verifies whether the pod can serve traffic right now (active DB connection, dependencies available). Live only verifies whether the process is alive and not in a deadlock.
readinessProbe:
httpGet:
path: /health/ready
port: 8080
initialDelaySeconds: 15
periodSeconds: 10
failureThreshold: 3
successThreshold: 1
livenessProbe:
httpGet:
path: /health/live
port: 8080
initialDelaySeconds: 30
periodSeconds: 20
failureThreshold: 4Resource Requests and Limits: balancing throttling vs. OOM
Without resource requests, the Kubernetes scheduler has no information to place pods optimally. Without limits, a single pod can consume all available memory on a node and trigger the OS OOM Killer — which doesn't distinguish between your application and Kubernetes system components.
The strategy we use in production: measure for at least two weeks using container_memory_working_set_bytes and container_cpu_usage_seconds_total in Prometheus. Requests = p50. Memory limits = p99.5 with a 20% headroom.
resources:
requests:
memory: "256Mi"
cpu: "100m"
limits:
memory: "512Mi"
cpu: "500m"Important: aggressive CPU limits cause silent throttling that manifests as elevated latency without visible errors. Monitor container_cpu_cfs_throttled_seconds_total to catch this before it affects your SLAs.
Anti-Affinity and Multi-Zone Distribution
Three replicas on the same node is apparent high availability but a real single point of failure. If that node fails, you lose 100% of capacity. topologySpreadConstraints lets you distribute pods across both nodes and availability zones simultaneously.
spec:
template:
spec:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: api
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: apitopologySpreadConstraints is more flexible than podAntiAffinity. maxSkew: 1 means the maximum difference in pod count across any topology domain can't exceed 1. whenUnsatisfiable: DoNotSchedule holds the pod in Pending until it can be placed in a balanced topology.
Graceful Shutdown: the clean termination nobody configures
When Kubernetes decides to terminate a pod, it sends SIGTERM. The application has terminationGracePeriodSeconds to shut down cleanly. If SIGTERM isn't handled, in-flight requests get cut abruptly. The less obvious problem: there's a lag between Kubernetes sending SIGTERM and the load balancer actually stopping traffic to the pod.
spec:
terminationGracePeriodSeconds: 60
containers:
- name: api
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "sleep 10"]The preStop sleep 10 gives kube-proxy time to update iptables rules and stop routing new traffic to the pod before shutdown begins. This pattern is critical for services with long-lived requests or persistent WebSocket connections.
Proactive Monitoring: alert before the downtime happens
Alerts that fire on CrashLoopBackOff are too late. To maintain a 99.9% SLO, degradation detection must happen before the outage materializes. The Prometheus/Alertmanager alerts we run in production:
- PodRestartRateHigh — more than 3 restarts in 15 minutes (fires before Kubernetes officially marks CrashLoopBackOff)
- DeploymentReplicasMismatch — available replicas below desired for more than 5 minutes
- ContainerCPUThrottling > 30% — silent throttling before it affects measurable latency
- PodPendingTooLong — pod in Pending state for more than 5 minutes (signals scheduling or capacity issue)
- NodeNotReady — node unavailable (enables preventive action before the scheduler redistributes pods)
Frequently Asked Questions
How many replicas do I need for zero downtime in Kubernetes?
What exactly does a Pod Disruption Budget do?
Why do my pods enter CrashLoopBackOff after a rollout?
When should I use HPA vs. KEDA for autoscaling?
How do I detect CPU throttling in my pods?
Running a Kubernetes cluster with recurring incidents? Our team can perform a technical audit and identify the specific availability risks in your infrastructure.
Talk to our team