A Kubernetes 1.31 upgrade should feel routine, like swapping a tire with the car still idling. In practice, it’s closer to changing parts while the engine runs, because controllers, admission webhooks, and CI jobs keep calling the API the whole time.
This runbook focuses on what usually breaks: removed in-tree integrations, Pod Security Admission surprises, webhook and CRD conversion failures, Helm and kustomize drift, RBAC “forbidden” errors, and CI failures triggered by version skew or changed API discovery.
Keep it boring on purpose: timebox prep, rehearse in staging, then upgrade with tight observability and a rollback plan.
T-2 weeks: inventory what will break (before it breaks you)
- Pick the exact target patch (v1.31.x) and freeze it for staging and prod.
- Read the upstream removal list, then map each item to a cluster dependency. Start with Kubernetes removals and major changes in v1.31, then confirm details in the Kubernetes 1.31 changelog.
- Scan for removed in-tree plugins (common foot-gun in older manifests):
- CephFS and RBD in-tree volume plugins are removed in 1.31. Find usage with
kubectl get pv -o yaml | grep -E 'cephfs|rbd'. - In-tree cloud provider code is removed. If you still depend on cloud-specific in-tree behavior, plan external cloud controllers and CSI drivers first.
- CephFS and RBD in-tree volume plugins are removed in 1.31. Find usage with
- Run an API usage audit against your cluster telemetry and pipelines. If you don’t already do this, follow a structured approach like pre-upgrade Kubernetes API checks. At minimum, capture:
kubectl api-resourceskubectl get --raw /metrics | grep apiserver_request_total(or your metrics backend equivalent)- Any controller logs that show repeated “deprecated” or “no matches for kind” warnings
- Find kubelet flags that disappeared and purge them from node configs before the upgrade. On nodes, inspect service drop-ins and live args:
ps aux | grep kubeletjournalctl -u kubelet -b | tail -n 200- Watch for removed flags such as
--keep-terminated-pod-volumes,--iptables-masquerade-bit,--iptables-drop-bit.
Gotcha: the upgrade rarely fails at the control plane first. It fails when new nodes come up with old kubelet flags, or when a webhook blocks every create request.
T-1 week: rehearse the upgrade and harden admission paths
- Rehearse in a staging cluster that matches prod (admission chain, CRDs, PSP/PSA posture, ingress, storage classes). If staging is “lighter,” it won’t catch your real blockers.
- Prove Pod Security Admission (PSA) behavior before prod:
- List namespace labels:
kubectl get ns --show-labels | grep pod-security - If you plan to tighten policy, start with
pod-security.kubernetes.io/warnandpod-security.kubernetes.io/auditfirst. - Remember: upgrades re-create pods. A namespace that quietly ran “privileged-ish” pods can start rejecting replacements under
enforce=baselineorenforce=restricted.
- List namespace labels:
- Audit admission webhooks for upgrade fragility:
kubectl get validatingwebhookconfigurations,mutatingwebhookconfigurations- Check
failurePolicy,timeoutSeconds, and whether the webhook Service and endpoints exist. - Confirm the webhook CA bundle and cert rotation path work during API server restarts.
- Audit CRDs with conversion webhooks:
kubectl get crd -o jsonpath='{range .items[?(@.spec.conversion.webhook)]}{.metadata.name}{"n"}{end}'- For each, confirm the conversion webhook Deployment is highly available and tolerates control plane disruptions.
- Pre-flight manifests against the new API server (catches schema drift and removed fields):
helm template ... | kubectl apply --dry-run=server -f -kustomize build ... | kubectl apply --dry-run=server -f -
- Write rollback criteria now (error budget, blocked deploys, node NotReady threshold), so you don’t debate it mid-incident.
Upgrade day: sequencing, watchpoints, and the exact commands to run
Use the upstream flow as your baseline, then apply your platform’s mechanics. The core sequence matches Upgrade a Cluster: control plane first, then nodes, then add-ons.
- Start a live “red board” (one channel, one doc): timestamps, versions, and every anomaly.
- Freeze deploys (or restrict to break-glass) so you don’t mix app changes with platform change.
- Before touching anything, record state:
kubectl version --shortkubectl get nodes -o widekubectl get pods -A -o widekubectl get events -A --sort-by=.metadata.creationTimestamp | tail -n 50
- Upgrade control plane, then immediately check API health:
- Watch API error rate and latency in your metrics system. Useful signals include
apiserver_request_total(5xx), and webhook metrics such asapiserver_admission_webhook_rejection_count. - If you run audit logs, filter for sudden spikes in denials and missing resources (for example, repeated
403with “forbidden,” or404on removed endpoints).
- Watch API error rate and latency in your metrics system. Useful signals include
- Upgrade nodes in small batches:
kubectl drain <node> --ignore-daemonsets --delete-emptydir-data --timeout=10m- After reboot or replacement, confirm:
kubectl get node <node>, thenkubectl uncordon <node> - If a node won’t rejoin, go straight to
journalctl -u kubelet -band look for removed flags or CNI errors.
- Validate add-ons and controllers (vendor-neutral, but always the same idea):
- DNS, CNI, kube-proxy (or replacement), ingress, storage CSI, metrics stack
- If any are managed by Helm, run
helm upgrade --dry-runfirst to catch schema changes.
Post-upgrade: prove it, then remove temporary safety rails
- Prove basic scheduling and networking:
- Create a canary Pod and a canary Service, then
kubectl logsandkubectl execto confirm DNS and egress.
- Create a canary Pod and a canary Service, then
- Run a conformance-style smoke test suited to your org:
- If you use Sonobuoy, run the quick suite your team trusts.
- If you don’t, at least validate critical APIs with
kubectl getacross core resources and key CRDs.
- Re-check RBAC with real identities:
- In CI service accounts, run
kubectl auth can-i --list(scoped to the namespaces they touch). - Watch for “forbidden” tied to resources that moved, disappeared, or now require different verbs.
- In CI service accounts, run
- Fix CI client-server skew:
- Update pinned
kubectl,client-go, and any generators that parsekubectl api-resources. - Re-run pipeline steps that do discovery, because API discovery output can change even when workloads look fine.
- Update pinned
- Tighten PSA only after stability:
- If you staged
warnoraudit, move toenforcein a controlled window, one namespace group at a time.
- If you staged
Fast triage: the failures you’ll see first (and where to look)
Use this table during the upgrade window to cut diagnosis time.
| Symptom | Likely cause | First commands to run |
|---|---|---|
New deploys fail with forbidden | PSA enforce labels or RBAC drift | kubectl get ns --show-labels, kubectl auth can-i <verb> <resource> -n <ns> |
| Everything fails to create, even ConfigMaps | Broken validating/mutating webhook | `kubectl get events -A |
| CRD objects fail with conversion errors | Conversion webhook down or incompatible | `kubectl get crd -o yaml |
| Nodes stuck NotReady after upgrade | Removed kubelet flags or CNI mismatch | journalctl -u kubelet -b, kubectl -n kube-system get pods -o wide |
| CI fails on discovery or apply | Client version skew, api-resources output changed | kubectl version --short, kubectl api-resources, rerun --dry-run=server |
If you can’t create or update any object, suspect admission first. A single webhook can stop the whole cluster from changing state.
Conclusion
A Kubernetes 1.31 upgrade goes well when you treat it like a dependency migration, not a button click. Inventory removals, rehearse PSA behavior, and pressure-test admission and CRD conversions in staging. On upgrade day, watch API errors and webhook health as closely as node readiness. Afterward, fix CI skew and only then tighten security labels. The cleanest upgrades are the ones where nothing “exciting” happens.

