Appendix B-Troubleshooting Reference Flows¶
Reference flows for the failure modes you will see in production.
B.1 Pod Pending Forever¶
Common causes (in observed-frequency order):
1. No node satisfies scheduling constraints. Events: shows FailedScheduling. Read the reason: insufficient CPU/memory, no matching nodeSelector, taints unmatched, no PV available, topology spread blocked.
2. PVC stuck pending. kubectl get pvc <pvc> - if Pending, check StorageClass, provisioner pods, cloud-side quota.
3. **Image pull failure**.Events:showsErrImagePull/ImagePullBackOff. Check registry auth, image tag exists, network egress to registry.
4. **Admission webhook rejected**. Often hidden in apiserver logs;kubectl get events -Amay surface it.
5. **Quota exceeded**.ResourceQuota` denied creation.
Drilldown: kubectl get events -A --sort-by=.lastTimestamp | tail -30.
B.2 Pod CrashLoopBackOff¶
Common causes:
1. App-level crash. Read the previous container's logs.
2. Liveness probe failing. The probe is killing the container. Check probe path/port; loosen initialDelaySeconds.
3. OOMKilled. kubectl describe shows Reason: OOMKilled. Increase memory limit or fix leak.
4. ConfigMap / Secret missing. Pod is mounting it; if missing, kubelet fails the start. Watch for events.
5. Init container failure. Pod won't progress; check init container logs first.
B.3 Node NotReady¶
Common causes:
1. kubelet down. systemd unit failure; check journal.
2. CNI agent down. The node has no functional networking; Cilium/Calico DaemonSet pod has crashed.
3. Disk pressure. Events: shows EvictionThresholdMet. Free space (delete old container images, journal logs).
4. PID pressure. Too many processes.
5. Out-of-resources kernel-side. Check dmesg on the node.
B.4 etcd Degraded¶
Common causes:
1. Disk full or slow. fsync latency spikes; everything else feels slow. Check etcd_disk_wal_fsync_duration_seconds.
2. Leader election thrashing. Network instability between etcd nodes; check inter-node latency.
3. Database size growth. Forgot to compact. etcdctl compact <rev>; etcdctl defrag.
4. Quorum lost. Majority of nodes down. Restore from snapshot to a new cluster; recover.
B.5 Apiserver 5xx / Timeouts¶
Common causes:
1. etcd issues (above).
2. Webhook timeouts. Slow admission webhooks block every apply. Check webhook latency; consider failurePolicy: Ignore with caution.
3. Aggregated API down (e.g., metrics-server). kubectl top fails; downstream features (HPA) degrade.
4. Apiserver overload. Too many list/watch consumers; CPU pegged. Add replicas; review priority-and-fairness flow control.
B.6 Service Has No Endpoints¶
Common causes:
1. Selector mismatch. Service spec.selector doesn't match Pod labels. Most common.
2. Pods not ready. ReadinessProbe failing; only ready Pods join Endpoints.
3. Port mismatch. Service port name vs container port name out of sync.
4. Topology-aware routing dropping endpoints. Check service.kubernetes.io/topology-aware-hints.
B.7 Namespace Stuck Terminating¶
Cause: A finalizer can't be removed because its owning controller is gone (or stuck).
Fix path (carefully-you are bypassing a safety):
kubectl get namespace <ns> -o json \
| jq '.spec.finalizers = []' \
| kubectl replace --raw "/api/v1/namespaces/<ns>/finalize" -f -
But also: investigate why the finalizer wouldn't clear. Often a dangling external resource the operator was waiting on.
B.8 ImagePullBackOff in a Private Registry¶
- `kubectl get secret
-o yaml - exists and well-formed? - Pod's
spec.imagePullSecretsreferences it? - Secret type is
kubernetes.io/dockerconfigjson? - Decoded
.dockerconfigjsonhas the right registry URL and credentials? - From the node, can you
crictl pullthe image manually with the same creds?
B.9 HPA Not Scaling¶
- `kubectl describe hpa
- events show why. - Metrics available?
kubectl get --raw "/apis/metrics.k8s.io/v1beta1/pods"for resource metrics;kubectl get --raw "/apis/custom.metrics.k8s.io/..."for custom. - Pod requests set? HPA uses
requestsas the denominator; without them, percentage-based metrics are meaningless. behaviorpolicies preventing fast scaling? CheckscaleUp.policiesandstabilizationWindowSeconds.
B.10 Mesh: 503 from Sidecar¶
(Istio specifics, but general patterns apply)
1. Service has Endpoints?
2. mTLS mode strict, but caller without sidecar? Check PeerAuthentication.
3. AuthorizationPolicy denying the call?
4. DestinationRule with circuit-breaker tripped? kubectl describe destinationrule.
5. Envoy access log: istioctl proxy-config log <pod> --level debug and re-issue.