Kubernetes 1.31 Upgrade Runbook (2026): Avoid Breakage

Reading Time: 4 minutes

A Kubernetes 1.31 upgrade should feel routine, like swapping a tire with the car still idling. In practice, it’s closer to changing parts while the engine runs, because controllers, admission webhooks, and CI jobs keep calling the API the whole time.

This runbook focuses on what usually breaks: removed in-tree integrations, Pod Security Admission surprises, webhook and CRD conversion failures, Helm and kustomize drift, RBAC “forbidden” errors, and CI failures triggered by version skew or changed API discovery.

Keep it boring on purpose: timebox prep, rehearse in staging, then upgrade with tight observability and a rollback plan.

T-2 weeks: inventory what will break (before it breaks you)

Pick the exact target patch (v1.31.x) and freeze it for staging and prod.
Read the upstream removal list, then map each item to a cluster dependency. Start with Kubernetes removals and major changes in v1.31, then confirm details in the Kubernetes 1.31 changelog.
Scan for removed in-tree plugins (common foot-gun in older manifests):
- CephFS and RBD in-tree volume plugins are removed in 1.31. Find usage with kubectl get pv -o yaml | grep -E 'cephfs|rbd'.
- In-tree cloud provider code is removed. If you still depend on cloud-specific in-tree behavior, plan external cloud controllers and CSI drivers first.
Run an API usage audit against your cluster telemetry and pipelines. If you don’t already do this, follow a structured approach like pre-upgrade Kubernetes API checks. At minimum, capture:
- kubectl api-resources
- kubectl get --raw /metrics | grep apiserver_request_total (or your metrics backend equivalent)
- Any controller logs that show repeated “deprecated” or “no matches for kind” warnings
Find kubelet flags that disappeared and purge them from node configs before the upgrade. On nodes, inspect service drop-ins and live args:
- ps aux | grep kubelet
- journalctl -u kubelet -b | tail -n 200
- Watch for removed flags such as --keep-terminated-pod-volumes, --iptables-masquerade-bit, --iptables-drop-bit.

Gotcha: the upgrade rarely fails at the control plane first. It fails when new nodes come up with old kubelet flags, or when a webhook blocks every create request.

T-1 week: rehearse the upgrade and harden admission paths

Rehearse in a staging cluster that matches prod (admission chain, CRDs, PSP/PSA posture, ingress, storage classes). If staging is “lighter,” it won’t catch your real blockers.
Prove Pod Security Admission (PSA) behavior before prod:
- List namespace labels: kubectl get ns --show-labels | grep pod-security
- If you plan to tighten policy, start with pod-security.kubernetes.io/warn and pod-security.kubernetes.io/audit first.
- Remember: upgrades re-create pods. A namespace that quietly ran “privileged-ish” pods can start rejecting replacements under enforce=baseline or enforce=restricted.
Audit admission webhooks for upgrade fragility:
- kubectl get validatingwebhookconfigurations,mutatingwebhookconfigurations
- Check failurePolicy, timeoutSeconds, and whether the webhook Service and endpoints exist.
- Confirm the webhook CA bundle and cert rotation path work during API server restarts.
Audit CRDs with conversion webhooks:
- kubectl get crd -o jsonpath='{range .items[?(@.spec.conversion.webhook)]}{.metadata.name}{"n"}{end}'
- For each, confirm the conversion webhook Deployment is highly available and tolerates control plane disruptions.
Pre-flight manifests against the new API server (catches schema drift and removed fields):
- helm template ... | kubectl apply --dry-run=server -f -
- kustomize build ... | kubectl apply --dry-run=server -f -
Write rollback criteria now (error budget, blocked deploys, node NotReady threshold), so you don’t debate it mid-incident.

Upgrade day: sequencing, watchpoints, and the exact commands to run

Use the upstream flow as your baseline, then apply your platform’s mechanics. The core sequence matches Upgrade a Cluster: control plane first, then nodes, then add-ons.

Start a live “red board” (one channel, one doc): timestamps, versions, and every anomaly.
Freeze deploys (or restrict to break-glass) so you don’t mix app changes with platform change.
Before touching anything, record state:
- kubectl version --short
- kubectl get nodes -o wide
- kubectl get pods -A -o wide
- kubectl get events -A --sort-by=.metadata.creationTimestamp | tail -n 50
Upgrade control plane, then immediately check API health:
- Watch API error rate and latency in your metrics system. Useful signals include apiserver_request_total (5xx), and webhook metrics such as apiserver_admission_webhook_rejection_count.
- If you run audit logs, filter for sudden spikes in denials and missing resources (for example, repeated 403 with “forbidden,” or 404 on removed endpoints).
Upgrade nodes in small batches:
- kubectl drain <node> --ignore-daemonsets --delete-emptydir-data --timeout=10m
- After reboot or replacement, confirm: kubectl get node <node>, then kubectl uncordon <node>
- If a node won’t rejoin, go straight to journalctl -u kubelet -b and look for removed flags or CNI errors.
Validate add-ons and controllers (vendor-neutral, but always the same idea):
- DNS, CNI, kube-proxy (or replacement), ingress, storage CSI, metrics stack
- If any are managed by Helm, run helm upgrade --dry-run first to catch schema changes.

Post-upgrade: prove it, then remove temporary safety rails

Prove basic scheduling and networking:
- Create a canary Pod and a canary Service, then kubectl logs and kubectl exec to confirm DNS and egress.
Run a conformance-style smoke test suited to your org:
- If you use Sonobuoy, run the quick suite your team trusts.
- If you don’t, at least validate critical APIs with kubectl get across core resources and key CRDs.
Re-check RBAC with real identities:
- In CI service accounts, run kubectl auth can-i --list (scoped to the namespaces they touch).
- Watch for “forbidden” tied to resources that moved, disappeared, or now require different verbs.
Fix CI client-server skew:
- Update pinned kubectl, client-go, and any generators that parse kubectl api-resources.
- Re-run pipeline steps that do discovery, because API discovery output can change even when workloads look fine.
Tighten PSA only after stability:
- If you staged warn or audit, move to enforce in a controlled window, one namespace group at a time.

Fast triage: the failures you’ll see first (and where to look)

Use this table during the upgrade window to cut diagnosis time.

Symptom	Likely cause	First commands to run
New deploys fail with `forbidden`	PSA enforce labels or RBAC drift	`kubectl get ns --show-labels`, `kubectl auth can-i <verb> <resource> -n <ns>`
Everything fails to create, even ConfigMaps	Broken validating/mutating webhook	`kubectl get events -A
CRD objects fail with conversion errors	Conversion webhook down or incompatible	`kubectl get crd -o yaml
Nodes stuck NotReady after upgrade	Removed kubelet flags or CNI mismatch	`journalctl -u kubelet -b`, `kubectl -n kube-system get pods -o wide`
CI fails on discovery or apply	Client version skew, api-resources output changed	`kubectl version --short`, `kubectl api-resources`, rerun `--dry-run=server`

If you can’t create or update any object, suspect admission first. A single webhook can stop the whole cluster from changing state.

Conclusion

A Kubernetes 1.31 upgrade goes well when you treat it like a dependency migration, not a button click. Inventory removals, rehearse PSA behavior, and pressure-test admission and CRD conversions in staging. On upgrade day, watch API errors and webhook health as closely as node readiness. Afterward, fix CI skew and only then tighten security labels. The cleanest upgrades are the ones where nothing “exciting” happens.

Kubernetes 1.31 upgrade checklist for sysadmins, API removals, Pod Security traps, and CI failures to expect