Skip to main content

Troubleshoot head node eviction

Troubleshoot head node eviction

Head nodes may be evicted due to pod preemption by higher-priority pods, causing cluster disruption.

Symptoms

Cluster logs show messages indicating head node eviction:

Instance k-ecb7f6f7c200c0000 (node IP: <node-ip>) will be terminated soon
(reason: pod eviction/deletion; for more details, check Kubernetes events for
pod/k-ecb7f6f7c200c0000 or node/<node-name>).

Or preemption messages:

Terminating instance k-d403ea0c7bf910000 due to an unexpected failure: Pod was disrupted
(reason: PreemptionByScheduler, details: default-scheduler: preempting to accommodate a higher priority pod).

Root cause

The head node is being preempted by Kubernetes to accommodate higher-priority pods. A PodDisruptionBudget (PDB) on the head node can help prevent this. See PodDisruptionBudget in the Kubernetes docs.

The workloads.enableAnyscaleRayHeadNodePDB Helm parameter defaults to true, so this issue typically appears only after someone disables PDB protection. See High availability.

Solution

Re-enable PodDisruptionBudget for Anyscale head nodes through the Helm chart configuration.

Configure PDB through Helm

Set the workloads.enableAnyscaleRayHeadNodePDB flag when installing or upgrading the Anyscale operator:

helm upgrade <release-name> anyscale/anyscale-operator \
--set workloads.enableAnyscaleRayHeadNodePDB=true \
--namespace <namespace>

Verify PDB configuration

Check that the PodDisruptionBudget was created:

kubectl get poddisruptionbudgets -n <namespace> -o yaml

Example output showing a properly configured PDB:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: anyscale-ray-head-nodes
namespace: <namespace>
labels:
app.kubernetes.io/instance: <release-name>
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/name: anyscale-operator
spec:
maxUnavailable: 0
selector:
matchLabels:
ray-node-type: head
unhealthyPodEvictionPolicy: AlwaysAllow

Prevention

To prevent head node eviction, do the following:

  1. Keep workloads.enableAnyscaleRayHeadNodePDB enabled in your Helm values, or re-enable it as shown above.
  2. Schedule head nodes on on-demand node pools rather than spot instances.
  3. Configure appropriate pod priorities for Anyscale workloads.
  4. Monitor cluster events for eviction warnings.

Related resources