Saltar a contenido

Appendix B-Troubleshooting Reference Flows

Reference flows for the failure modes you will see in production.


B.1 Pod Pending Forever

kubectl describe pod <pod>

Common causes (in observed-frequency order): 1. No node satisfies scheduling constraints. Events: shows FailedScheduling. Read the reason: insufficient CPU/memory, no matching nodeSelector, taints unmatched, no PV available, topology spread blocked. 2. PVC stuck pending. kubectl get pvc <pvc> - if Pending, check StorageClass, provisioner pods, cloud-side quota. 3. **Image pull failure**.Events:showsErrImagePull/ImagePullBackOff. Check registry auth, image tag exists, network egress to registry. 4. **Admission webhook rejected**. Often hidden in apiserver logs;kubectl get events -Amay surface it. 5. **Quota exceeded**.ResourceQuota` denied creation.

Drilldown: kubectl get events -A --sort-by=.lastTimestamp | tail -30.


B.2 Pod CrashLoopBackOff

kubectl logs <pod> --previous
kubectl describe pod <pod>

Common causes: 1. App-level crash. Read the previous container's logs. 2. Liveness probe failing. The probe is killing the container. Check probe path/port; loosen initialDelaySeconds. 3. OOMKilled. kubectl describe shows Reason: OOMKilled. Increase memory limit or fix leak. 4. ConfigMap / Secret missing. Pod is mounting it; if missing, kubelet fails the start. Watch for events. 5. Init container failure. Pod won't progress; check init container logs first.


B.3 Node NotReady

kubectl describe node <node>
ssh <node> sudo journalctl -u kubelet -f

Common causes: 1. kubelet down. systemd unit failure; check journal. 2. CNI agent down. The node has no functional networking; Cilium/Calico DaemonSet pod has crashed. 3. Disk pressure. Events: shows EvictionThresholdMet. Free space (delete old container images, journal logs). 4. PID pressure. Too many processes. 5. Out-of-resources kernel-side. Check dmesg on the node.


B.4 etcd Degraded

etcdctl endpoint status --cluster
etcdctl endpoint health --cluster

Common causes: 1. Disk full or slow. fsync latency spikes; everything else feels slow. Check etcd_disk_wal_fsync_duration_seconds. 2. Leader election thrashing. Network instability between etcd nodes; check inter-node latency. 3. Database size growth. Forgot to compact. etcdctl compact <rev>; etcdctl defrag. 4. Quorum lost. Majority of nodes down. Restore from snapshot to a new cluster; recover.


B.5 Apiserver 5xx / Timeouts

kubectl get --raw=/livez
kubectl get --raw=/readyz?verbose

Common causes: 1. etcd issues (above). 2. Webhook timeouts. Slow admission webhooks block every apply. Check webhook latency; consider failurePolicy: Ignore with caution. 3. Aggregated API down (e.g., metrics-server). kubectl top fails; downstream features (HPA) degrade. 4. Apiserver overload. Too many list/watch consumers; CPU pegged. Add replicas; review priority-and-fairness flow control.


B.6 Service Has No Endpoints

kubectl get endpoints <service>
kubectl get endpointslices -l kubernetes.io/service-name=<service>

Common causes: 1. Selector mismatch. Service spec.selector doesn't match Pod labels. Most common. 2. Pods not ready. ReadinessProbe failing; only ready Pods join Endpoints. 3. Port mismatch. Service port name vs container port name out of sync. 4. Topology-aware routing dropping endpoints. Check service.kubernetes.io/topology-aware-hints.


B.7 Namespace Stuck Terminating

kubectl get namespace <ns> -o json | jq .spec.finalizers

Cause: A finalizer can't be removed because its owning controller is gone (or stuck).

Fix path (carefully-you are bypassing a safety):

kubectl get namespace <ns> -o json \
  | jq '.spec.finalizers = []' \
  | kubectl replace --raw "/api/v1/namespaces/<ns>/finalize" -f -

But also: investigate why the finalizer wouldn't clear. Often a dangling external resource the operator was waiting on.


B.8 ImagePullBackOff in a Private Registry

  1. `kubectl get secret -o yaml - exists and well-formed?
  2. Pod's spec.imagePullSecrets references it?
  3. Secret type is kubernetes.io/dockerconfigjson?
  4. Decoded .dockerconfigjson has the right registry URL and credentials?
  5. From the node, can you crictl pull the image manually with the same creds?

B.9 HPA Not Scaling

  1. `kubectl describe hpa - events show why.
  2. Metrics available? kubectl get --raw "/apis/metrics.k8s.io/v1beta1/pods" for resource metrics; kubectl get --raw "/apis/custom.metrics.k8s.io/..." for custom.
  3. Pod requests set? HPA uses requests as the denominator; without them, percentage-based metrics are meaningless.
  4. behavior policies preventing fast scaling? Check scaleUp.policies and stabilizationWindowSeconds.

B.10 Mesh: 503 from Sidecar

(Istio specifics, but general patterns apply) 1. Service has Endpoints? 2. mTLS mode strict, but caller without sidecar? Check PeerAuthentication. 3. AuthorizationPolicy denying the call? 4. DestinationRule with circuit-breaker tripped? kubectl describe destinationrule. 5. Envoy access log: istioctl proxy-config log <pod> --level debug and re-issue.

Comments