---
title: "Troubleshoot GPU pods stuck in Pending"
description: "Fix GPU pods stuck in Pending on Anyscale on Azure by adding the right tolerations and confirming the NVIDIA device plugin is ready."
---

# Troubleshoot GPU pods stuck in Pending

When a GPU worker pod stays in `Pending` and `kubectl describe pod <pod>` shows an `untolerated taint` event, the pod's tolerations don't match the taints on your AKS GPU node pool. Anyscale on Azure applies several taints to GPU and capacity-type-specific node pools to keep general workloads off them. Your Ray cluster pod specs must tolerate those taints.

## Taints applied by Anyscale on Azure node pools

Anyscale on Azure quickstart node pools apply the following taints. Your pod specs must tolerate any taints present on the node pool you want them to schedule on.

| Taint key | Where it appears |
| --- | --- |
| `nvidia.com/gpu` | Any node pool backed by a GPU-capable VM SKU. |
| `node.anyscale.com/capacity-type` | Node pools that Anyscale manages by capacity type, such as on-demand or spot. |
| `kubernetes.azure.com/scalesetpriority` | Spot-priority node pools managed by AKS. |

Inspect the taints on a specific node pool with:

```bash
kubectl get nodes -o json | jq '.items[] | {name: .metadata.name, taints: .spec.taints}'
```

## Add tolerations to your Anyscale compute config

Add a `tolerations` block to your Anyscale compute config under `advanced_instance_config`. The block applies to all pods at the cluster level, or to a specific worker group at the worker-group level. Worker-group tolerations fully replace any cluster-level tolerations for that group.

For example, to tolerate the three taints above on a GPU worker group, add:

```yaml
tolerations:
  - key: nvidia.com/gpu
    operator: Exists
    effect: NoSchedule
  - key: node.anyscale.com/capacity-type
    operator: Exists
    effect: NoSchedule
  - key: kubernetes.azure.com/scalesetpriority
    operator: Equal
    value: spot
    effect: NoSchedule
```

For the full compute-config syntax including how to apply settings at the cluster level versus the worker-group level, see [Compute configuration options for Kubernetes](/configuration/compute/kubernetes.md).

## Confirm the NVIDIA device plugin is running

Even with the right tolerations, GPU pods stay `Pending` if the NVIDIA device plugin daemonset isn't running and `Ready` on your GPU nodes. The device plugin exposes `nvidia.com/gpu` as a schedulable resource. Without it, the scheduler can't satisfy a GPU resource request.

Check the device plugin daemonset:

```bash
kubectl get daemonset -n kube-system -l name=nvidia-device-plugin-ds
```

Confirm `DESIRED` and `READY` match the number of GPU nodes in your cluster. If the daemonset is missing or pods aren't ready, install or reinstall it per [Use GPUs on AKS](https://learn.microsoft.com/azure/aks/gpu-cluster).

## Related Azure docs

-   [Compute configuration options for Kubernetes](/configuration/compute/kubernetes.md) shows the `advanced_instance_config` block where tolerations, node selectors, and labels go.
-   [Quickstart Step 1b: create the AKS cluster](https://learn.microsoft.com/azure/anyscale-on-azure/quickstart-azure-cli-gateway-envoy#1b-create-the-aks-cluster) covers GPU quota requirements and node pool guidance.
-   [Use GPUs on AKS](https://learn.microsoft.com/azure/aks/gpu-cluster) is the canonical Microsoft Learn guide for adding GPU node pools and installing the NVIDIA device plugin.

---

Previous: [Troubleshoot Anyscale on Azure cloud creation](/kb/azure/troubleshoot-cloud-creation.md) | Next: [Troubleshoot storage account connectivity](/kb/azure/troubleshoot-storage-connectivity.md)