Troubleshoot GPU visibility in pods

When configuring multi-GPU nodes to run multiple single-GPU pods, each pod may be able to see all GPUs on the node instead of only the GPU it requested.

Symptoms

A pod configured to request one GPU can see all GPUs on the node when querying with pynvml or similar tools. The following sections show an example pod configuration, the test code that surfaces the symptom, and the unexpected output.

Example configuration

Instance type configured as a 1 GPU slice of an 8 GPU node:

a100-1g-spot:  # 1 GPU pod shape
  resources:
    CPU: 11
    GPU: 1
    memory: 163Gi
    "accelerator_type:A100-80G": 1
  nodeSelector:
    cloud.google.com/gke-nodepool: <gpu-node-pool>
  tolerations:
    - key: "workload.anyscale.com/type"
      value: "<workload-type>"
      effect: "NoSchedule"
    - key: "nvidia.com/gpu"
      value: "present"
      effect: "NoSchedule"

Example test code

import subprocess
import pynvml
import ray

@ray.remote(num_gpus=1)
def gpu_function():
    subprocess.run(["nvidia-smi"], check=True)
    pynvml.nvmlInit()
    try:
        print("Num GPUs:", pynvml.nvmlDeviceGetCount())
        print("GPU Names:", [
            pynvml.nvmlDeviceGetName(pynvml.nvmlDeviceGetHandleByIndex(i))
            for i in range(pynvml.nvmlDeviceGetCount())
        ])
    finally:
        pynvml.nvmlShutdown()

if __name__ == "__main__":
    ray.get(gpu_function.remote())

Unexpected output

(gpu_function pid=<pid>, ip=<node-ip>) Num GPUs: 8
(gpu_function pid=<pid>, ip=<node-ip>) GPU Names: ['NVIDIA A100-SXM4-80GB', 'NVIDIA A100-SXM4-80GB', ...]

The pod sees all 8 GPUs on the node instead of only the 1 GPU it requested.

Common causes

Three causes account for most cases of this behavior: a container image that overrides GPU visibility, a pod spec that doesn't request a GPU resource, and a misconfigured device plugin or container runtime.

Image sets NVIDIA_VISIBLE_DEVICES to all

If your container image sets the environment variable NVIDIA_VISIBLE_DEVICES=all, the pod can see all GPUs on the node.

Check your Dockerfile and remove or modify the NVIDIA_VISIBLE_DEVICES setting:

# Remove this line or don't set it. Let the Kubernetes device plugin handle GPU assignment:
# ENV NVIDIA_VISIBLE_DEVICES all

See the NVIDIA Container Toolkit documentation for details.

Pod didn't request or limit GPU

If the pod specification doesn't include GPU requests and limits, the NVIDIA device plugin won't inject the environment variables needed to restrict GPU visibility.

Ensure your pod specification includes GPU resources:

resources:
  requests:
    nvidia.com/gpu: 1
  limits:
    nvidia.com/gpu: 1

See the NVIDIA device plugin documentation for details.

Device plugin or runtime misconfigured

The NVIDIA device plugin or container runtime may not be running or may be using the wrong strategy.

Verify the NVIDIA device plugin is running:

kubectl get pods -n kube-system -l app=nvidia-device-plugin

Check the device plugin configuration and ensure it's using the appropriate strategy for your use case.

Verification

After applying fixes, verify that the pod can only see the requested number of GPUs:

import pynvml

pynvml.nvmlInit()
try:
    num_gpus = pynvml.nvmlDeviceGetCount()
    print(f"Visible GPUs: {num_gpus}")
    # Should print "Visible GPUs: 1"
finally:
    pynvml.nvmlShutdown()

Out of scope

This article covers GPU visibility for pods that request a subset of GPUs on a multi-GPU node. RDMA setups use /dev/infiniband* devices and can produce related symptoms such as wrong GPU counts in the pod. The diagnostics and fixes for RDMA visibility differ from those documented here and aren't covered.

Symptoms​

Example configuration​

Example test code​

Unexpected output​

Common causes​

Image sets NVIDIA_VISIBLE_DEVICES to all​

Pod didn't request or limit GPU​

Device plugin or runtime misconfigured​

Verification​

Out of scope​