Skip to main content

Upgrade Kubernetes nodes with Anyscale services running

Upgrade Kubernetes nodes with Anyscale services running

This article explains how to perform Kubernetes node upgrades and cluster version upgrades without disrupting Anyscale services backed by long-lived Ray clusters.

Standard Kubernetes node drain procedures block on the Ray head pod because Anyscale deploys a PodDisruptionBudget (PDB) that prevents the head from being evicted. Use the service rollout method described here to vacate old nodes safely.

Symptoms

When you cordon or drain a node running an Anyscale service, you may see the following:

  • kubectl drain hangs or reports that eviction is blocked by a PDB.
  • Pods remain on the old or cordoned node indefinitely.
  • kubectl get PodDisruptionBudget -n <anyscale-ns> shows Allowed disruptions: 0:
NAME                      MIN AVAILABLE   MAX UNAVAILABLE   ALLOWED DISRUPTIONS   AGE
anyscale-ray-head-nodes N/A 0 0 156d
  • The PDB description shows a warning about unmanaged pods:
Warning  UnmanagedPods  Pods selected by this PodDisruptionBudget were found to be
unmanaged. As a result, the status of the PDB cannot be calculated correctly, which
may result in undefined behavior.

Cause

Anyscale's Kubernetes operator deploys a PDB named anyscale-ray-head-nodes that targets Ray head pods and sets maxUnavailable: 0. This tells Kubernetes that zero head pods can be unavailable at any time, so any eviction attempt is blocked, including on cordoned nodes.

The UnmanagedPods warning appears because Ray head pods are managed by the Anyscale operator rather than a standard Kubernetes Deployment. Kubernetes can't determine the replica count from the usual sources, so the PDB's disruption budget calculation returns zero allowed disruptions regardless of actual cluster state.

Solution

Both manual node replacements and managed cluster upgrades require the same approach: trigger an Anyscale service rollout to move Ray pods off target nodes before Kubernetes attempts to drain them. The PDB blocks automated drains just as it blocks manual ones, so pre-positioning pods is the key step in either case.

Manual node replacement

Use this procedure when you're replacing specific nodes yourself, for example, when swapping instance types or terminating individual nodes.

To perform the upgrade, do the following:

  1. Cordon the nodes you want to vacate so no new pods schedule on them:

    kubectl cordon <node-name>

    Repeat for each node you're replacing.

  2. Trigger a new service version in the Anyscale console or CLI. Bumping an environment variable or editing a comment in the service config counts as a change and is enough to start a rollout.

  3. Anyscale creates new replicas, including a new Ray head, on the untainted nodes and gradually shifts traffic to them.

  4. Once the new version is healthy and traffic has shifted, Anyscale terminates the old replicas. The old nodes are now free of Anyscale pods.

  5. Verify no Anyscale pods remain on the old nodes:

    kubectl get pods -n <anyscale-ns> -o wide | grep <node-name>
  6. Drain and terminate the old nodes:

    kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data

Managed cluster upgrades

Use this procedure when your cloud provider or upgrade tooling drains nodes automatically, for example, EKS managed node group updates, GKE node pool upgrades, or AKS node pool upgrades. These tools initiate drains without warning, so you must pre-position Anyscale pods before starting the upgrade.

To prepare for a managed upgrade, do the following:

  1. Identify all Anyscale services running on the cluster.

  2. Trigger a new version rollout for each service. Bumping an environment variable or editing a comment in the service config is enough.

  3. Wait for all rollouts to complete. Confirm each service shows the new version healthy in the Anyscale console before continuing.

  4. Verify no Ray head pods remain on the nodes scheduled for replacement:

    kubectl get pods -n <anyscale-ns> -o wide
  5. Initiate the managed cluster upgrade. The nodes targeted for replacement should now be free of Anyscale pods, so the automated drain proceeds without PDB conflicts.