Troubleshoot head node eviction
Troubleshoot head node eviction
Head nodes may be evicted due to pod preemption by higher-priority pods, causing cluster disruption.
Symptoms
Cluster logs show messages indicating head node eviction:
Instance k-ecb7f6f7c200c0000 (node IP: <node-ip>) will be terminated soon
(reason: pod eviction/deletion; for more details, check Kubernetes events for
pod/k-ecb7f6f7c200c0000 or node/<node-name>).
Or preemption messages:
Terminating instance k-d403ea0c7bf910000 due to an unexpected failure: Pod was disrupted
(reason: PreemptionByScheduler, details: default-scheduler: preempting to accommodate a higher priority pod).
Root cause
The head node is being preempted by Kubernetes to accommodate higher-priority pods. A PodDisruptionBudget (PDB) on the head node can help prevent this. See PodDisruptionBudget in the Kubernetes docs.
The workloads.enableAnyscaleRayHeadNodePDB Helm parameter defaults to true, so this issue typically appears only after someone disables PDB protection. See High availability.
Solution
Re-enable PodDisruptionBudget for Anyscale head nodes through the Helm chart configuration.
Configure PDB through Helm
Set the workloads.enableAnyscaleRayHeadNodePDB flag when installing or upgrading the Anyscale operator:
helm upgrade <release-name> anyscale/anyscale-operator \
--set workloads.enableAnyscaleRayHeadNodePDB=true \
--namespace <namespace>
Verify PDB configuration
Check that the PodDisruptionBudget was created:
kubectl get poddisruptionbudgets -n <namespace> -o yaml
Example output showing a properly configured PDB:
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: anyscale-ray-head-nodes
namespace: <namespace>
labels:
app.kubernetes.io/instance: <release-name>
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/name: anyscale-operator
spec:
maxUnavailable: 0
selector:
matchLabels:
ray-node-type: head
unhealthyPodEvictionPolicy: AlwaysAllow
Prevention
To prevent head node eviction, do the following:
- Keep
workloads.enableAnyscaleRayHeadNodePDBenabled in your Helm values, or re-enable it as shown above. - Schedule head nodes on on-demand node pools rather than spot instances.
- Configure appropriate pod priorities for Anyscale workloads.
- Monitor cluster events for eviction warnings.
Related resources
- Kubernetes PodDisruptionBudget documentation
- For high-availability Helm parameters, see High availability.
- For guidance on temporarily disabling PDB to allow node upgrades, see Upgrade Kubernetes nodes with Anyscale services running.