IQS | How to Reduce Kubernetes Downtime

Kubernetes downtime doesn't come from where you expect

Before covering solutions, we need to name the actual causes. Across production clusters we manage, 62% of service degradation events originate in misconfigured Deployments (probes, resource limits, rolling update strategy). Only 15% have their root cause in underlying infrastructure failures.

The most common patterns: rolling updates that temporarily leave the service without available replicas, pods entering CrashLoopBackOff during deployments due to aggressive probes, and overcommitted nodes that trigger the OS OOM Killer at the worst possible moment.

Pod Disruption Budgets: your first line of defense

The most critical mistake we find in production clusters is running rolling updates or node maintenance without PodDisruptionBudgets (PDB). Without a PDB, Kubernetes can terminate as many pods as needed during a node drain or rolling update, potentially leaving your service with zero available replicas.

yaml

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: api-pdb
  namespace: production
spec:
  minAvailable: 2   # always keep at least 2 pods running
  selector:
    matchLabels:
      app: api

The operational rule: every Deployment with more than one replica receiving production traffic must have a PDB. No exceptions. With 3 replicas and no PDB, the Cluster Autoscaler can drain 2 nodes simultaneously, leaving a single replica active — exactly when you don't want that.

PDBs protect against voluntary disruptions (node draining, rolling updates, Cluster Autoscaler). They do not protect against involuntary hardware failures. For full resilience, combine PDBs with anti-affinity rules and multi-zone distribution.

Rolling Update Strategy: maxSurge and maxUnavailable matter

Kubernetes default rolling update configuration (maxUnavailable: 25%, maxSurge: 25%) can leave your service running at 75% capacity during an update. For a 4-replica Deployment under high traffic, that can be enough to saturate the remaining replicas.

yaml

spec:
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0  # never reduce available replicas during updates

With maxUnavailable: 0, Kubernetes first brings up the new pod (maxSurge: 1), waits for it to pass the readiness probe, and only then terminates the old pod. This guarantees you always have at least the desired number of replicas available throughout the update.

Readiness and Liveness Probes: the configuration that determines uptime

Misconfigured probes are the silent killer of Kubernetes uptime. The most common scenario: initialDelaySeconds is too low for the application startup time, the readiness probe passes before the database connection pool is ready, the pod receives traffic, the first real request fails, and if the liveness probe is also aggressive, the process enters CrashLoopBackOff.

The /health/ready and /health/live endpoints must be semantically distinct. Ready verifies whether the pod can serve traffic right now (active DB connection, dependencies available). Live only verifies whether the process is alive and not in a deadlock.

yaml

readinessProbe:
  httpGet:
    path: /health/ready
    port: 8080
  initialDelaySeconds: 15
  periodSeconds: 10
  failureThreshold: 3
  successThreshold: 1
livenessProbe:
  httpGet:
    path: /health/live
    port: 8080
  initialDelaySeconds: 30
  periodSeconds: 20
  failureThreshold: 4

Keep the liveness probe conservative with failureThreshold. A pod that kills itself during a momentary latency spike and then does a full cold start generates more downtime than one that responds slowly for a few seconds.

Resource Requests and Limits: balancing throttling vs. OOM

Without resource requests, the Kubernetes scheduler has no information to place pods optimally. Without limits, a single pod can consume all available memory on a node and trigger the OS OOM Killer — which doesn't distinguish between your application and Kubernetes system components.

The strategy we use in production: measure for at least two weeks using container_memory_working_set_bytes and container_cpu_usage_seconds_total in Prometheus. Requests = p50. Memory limits = p99.5 with a 20% headroom.

yaml

resources:
  requests:
    memory: "256Mi"
    cpu: "100m"
  limits:
    memory: "512Mi"
    cpu: "500m"

Important: aggressive CPU limits cause silent throttling that manifests as elevated latency without visible errors. Monitor container_cpu_cfs_throttled_seconds_total to catch this before it affects your SLAs.

Anti-Affinity and Multi-Zone Distribution

Three replicas on the same node is apparent high availability but a real single point of failure. If that node fails, you lose 100% of capacity. topologySpreadConstraints lets you distribute pods across both nodes and availability zones simultaneously.

yaml

spec:
  template:
    spec:
      topologySpreadConstraints:
      - maxSkew: 1
        topologyKey: kubernetes.io/hostname
        whenUnsatisfiable: DoNotSchedule
        labelSelector:
          matchLabels:
            app: api
      - maxSkew: 1
        topologyKey: topology.kubernetes.io/zone
        whenUnsatisfiable: DoNotSchedule
        labelSelector:
          matchLabels:
            app: api

topologySpreadConstraints is more flexible than podAntiAffinity. maxSkew: 1 means the maximum difference in pod count across any topology domain can't exceed 1. whenUnsatisfiable: DoNotSchedule holds the pod in Pending until it can be placed in a balanced topology.

Graceful Shutdown: the clean termination nobody configures

When Kubernetes decides to terminate a pod, it sends SIGTERM. The application has terminationGracePeriodSeconds to shut down cleanly. If SIGTERM isn't handled, in-flight requests get cut abruptly. The less obvious problem: there's a lag between Kubernetes sending SIGTERM and the load balancer actually stopping traffic to the pod.

yaml

spec:
  terminationGracePeriodSeconds: 60
  containers:
  - name: api
    lifecycle:
      preStop:
        exec:
          command: ["/bin/sh", "-c", "sleep 10"]

The preStop sleep 10 gives kube-proxy time to update iptables rules and stop routing new traffic to the pod before shutdown begins. This pattern is critical for services with long-lived requests or persistent WebSocket connections.

Proactive Monitoring: alert before the downtime happens

Alerts that fire on CrashLoopBackOff are too late. To maintain a 99.9% SLO, degradation detection must happen before the outage materializes. The Prometheus/Alertmanager alerts we run in production:

PodRestartRateHigh — more than 3 restarts in 15 minutes (fires before Kubernetes officially marks CrashLoopBackOff)
DeploymentReplicasMismatch — available replicas below desired for more than 5 minutes
ContainerCPUThrottling > 30% — silent throttling before it affects measurable latency
PodPendingTooLong — pod in Pending state for more than 5 minutes (signals scheduling or capacity issue)
NodeNotReady — node unavailable (enables preventive action before the scheduler redistributes pods)

Frequently Asked Questions

How many replicas do I need for zero downtime in Kubernetes?

A minimum of 3 replicas distributed across different nodes, combined with a PDB of minAvailable: 2. With 2 replicas, a planned disruption leaves only 1 active — any additional failure causes downtime. With 3 replicas and a correct PDB, you always have a safety margin even with one node in maintenance.

What exactly does a Pod Disruption Budget do?

A PDB limits how many pods of a Deployment can be simultaneously unavailable during voluntary disruptions: node draining, rolling updates, Cluster Autoscaler scale-downs. It does not protect against involuntary hardware failures. It's the piece that bridges theoretical and operational high availability.

Why do my pods enter CrashLoopBackOff after a rollout?

Most frequent causes in order of probability: (1) readiness probe too aggressive, killing the pod during startup, (2) insufficient memory (OOM during cold start), (3) misconfigured environment variables or ConfigMaps, (4) incorrect Docker image or regression. Always start with kubectl describe pod <name> and kubectl logs <name> --previous to inspect the state before the crash.

When should I use HPA vs. KEDA for autoscaling?

HPA is sufficient for services where load correlates with CPU or memory — proxies, synchronous APIs. KEDA is necessary when the work lives in queues (Kafka, SQS, RabbitMQ) or when load doesn't directly reflect in system metrics. For HTTP APIs under variable load, HPA targeting 60-70% CPU works well. For batch processing workers or event-driven systems, KEDA with queue depth metrics produces significantly better results.

How do I detect CPU throttling in my pods?

Query the container_cpu_cfs_throttled_seconds_total metric in Prometheus. If the ratio between throttled seconds and total seconds consistently exceeds 25%, your CPU limits are too low for the workload. This throttling produces high-percentile latency increases (p99) without affecting p50, making it hard to detect without the right metrics.

Running a Kubernetes cluster with recurring incidents? Our team can perform a technical audit and identify the specific availability risks in your infrastructure.

Talk to our team

AI · RAG

How to Reduce Downtime in Kubernetes

Kubernetes downtime doesn't come from where you expect

Pod Disruption Budgets: your first line of defense

Rolling Update Strategy: maxSurge and maxUnavailable matter

Readiness and Liveness Probes: the configuration that determines uptime

Resource Requests and Limits: balancing throttling vs. OOM

Anti-Affinity and Multi-Zone Distribution

Graceful Shutdown: the clean termination nobody configures

Proactive Monitoring: alert before the downtime happens

Frequently Asked Questions

Related articles

How to Build a RAG System: AI Over Your Own Data

Platform Engineering: How to Build an Internal Developer Platform (IDP)

AWS vs Azure vs GCP in the Dominican Republic: Costs, Capabilities, and Which to Choose