Skip to main content

Troubleshoot GPU pods stuck in Pending

Troubleshoot GPU pods stuck in Pending

When a GPU worker pod stays in Pending and kubectl describe pod <pod> shows an untolerated taint event, the pod's tolerations don't match the taints on your AKS GPU node pool. Anyscale on Azure applies several taints to GPU and capacity-type-specific node pools to keep general workloads off them. Your Ray cluster pod specs must tolerate those taints.

Taints applied by Anyscale on Azure node pools

Anyscale on Azure quickstart node pools apply the following taints. Your pod specs must tolerate any taints present on the node pool you want them to schedule on.

Taint keyWhere it appears
nvidia.com/gpuAny node pool backed by a GPU-capable VM SKU.
node.anyscale.com/capacity-typeNode pools that Anyscale manages by capacity type, such as on-demand or spot.
kubernetes.azure.com/scalesetprioritySpot-priority node pools managed by AKS.

Inspect the taints on a specific node pool with:

kubectl get nodes -o json | jq '.items[] | {name: .metadata.name, taints: .spec.taints}'

Add tolerations to your Anyscale compute config

Add a tolerations block to your Anyscale compute config under advanced_instance_config. The block applies to all pods at the cluster level, or to a specific worker group at the worker-group level. Worker-group tolerations fully replace any cluster-level tolerations for that group.

For example, to tolerate the three taints above on a GPU worker group, add:

tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
- key: node.anyscale.com/capacity-type
operator: Exists
effect: NoSchedule
- key: kubernetes.azure.com/scalesetpriority
operator: Equal
value: spot
effect: NoSchedule

For the full compute-config syntax including how to apply settings at the cluster level versus the worker-group level, see Compute configuration options for Kubernetes.

Confirm the NVIDIA device plugin is running

Even with the right tolerations, GPU pods stay Pending if the NVIDIA device plugin daemonset isn't running and Ready on your GPU nodes. The device plugin exposes nvidia.com/gpu as a schedulable resource. Without it, the scheduler can't satisfy a GPU resource request.

Check the device plugin daemonset:

kubectl get daemonset -n kube-system -l name=nvidia-device-plugin-ds

Confirm DESIRED and READY match the number of GPU nodes in your cluster. If the daemonset is missing or pods aren't ready, install or reinstall it per Use GPUs on AKS.