Troubleshoot GPU pods stuck in Pending

When a GPU worker pod stays in Pending and kubectl describe pod <pod> shows an untolerated taint event, the pod's tolerations don't match the taints on your AKS GPU node pool. Anyscale on Azure applies several taints to GPU and capacity-type-specific node pools to keep general workloads off them. Your Ray cluster pod specs must tolerate those taints.

Taints applied by Anyscale on Azure node pools

Anyscale on Azure quickstart node pools apply the following taints. Your pod specs must tolerate any taints present on the node pool you want them to schedule on.

Taint key	Where it appears
`nvidia.com/gpu`	Any node pool backed by a GPU-capable VM SKU.
`node.anyscale.com/capacity-type`	Node pools that Anyscale manages by capacity type, such as on-demand or spot.
`kubernetes.azure.com/scalesetpriority`	Spot-priority node pools managed by AKS.

Inspect the taints on a specific node pool with:

kubectl get nodes -o json | jq '.items[] | {name: .metadata.name, taints: .spec.taints}'

Add tolerations to your Anyscale compute config

Add a tolerations block to your Anyscale compute config under advanced_instance_config. The block applies to all pods at the cluster level, or to a specific worker group at the worker-group level. Worker-group tolerations fully replace any cluster-level tolerations for that group.

For example, to tolerate the three taints above on a GPU worker group, add:

tolerations:
  - key: nvidia.com/gpu
    operator: Exists
    effect: NoSchedule
  - key: node.anyscale.com/capacity-type
    operator: Exists
    effect: NoSchedule
  - key: kubernetes.azure.com/scalesetpriority
    operator: Equal
    value: spot
    effect: NoSchedule

For the full compute-config syntax including how to apply settings at the cluster level versus the worker-group level, see Compute configuration options for Kubernetes.

Confirm the NVIDIA device plugin is running

Even with the right tolerations, GPU pods stay Pending if the NVIDIA device plugin daemonset isn't running and Ready on your GPU nodes. The device plugin exposes nvidia.com/gpu as a schedulable resource. Without it, the scheduler can't satisfy a GPU resource request.

Check the device plugin daemonset:

kubectl get daemonset -n kube-system -l name=nvidia-device-plugin-ds

Confirm DESIRED and READY match the number of GPU nodes in your cluster. If the daemonset is missing or pods aren't ready, install or reinstall it per Use GPUs on AKS.

Compute configuration options for Kubernetes shows the advanced_instance_config block where tolerations, node selectors, and labels go.
Quickstart Step 1b: create the AKS cluster covers GPU quota requirements and node pool guidance.
Use GPUs on AKS is the canonical Microsoft Learn guide for adding GPU node pools and installing the NVIDIA device plugin.

Taints applied by Anyscale on Azure node pools​

Add tolerations to your Anyscale compute config​

Confirm the NVIDIA device plugin is running​

Related Azure docs​

Taints applied by Anyscale on Azure node pools

Add tolerations to your Anyscale compute config

Confirm the NVIDIA device plugin is running

Related Azure docs