Troubleshoot GPU visibility in pods
Troubleshoot GPU visibility in pods
When configuring multi-GPU nodes to run multiple single-GPU pods, each pod may be able to see all GPUs on the node instead of only the GPU it requested.
Symptoms
A pod configured to request one GPU can see all GPUs on the node when querying via pynvml or similar tools. The following sections show an example pod configuration, the test code that surfaces the symptom, and the unexpected output.
Example configuration
Instance type configured as a 1 GPU slice of an 8 GPU node:
a100-1g-spot: # 1 GPU pod shape
resources:
CPU: 11
GPU: 1
memory: 163Gi
"accelerator_type:A100-80G": 1
nodeSelector:
cloud.google.com/gke-nodepool: <gpu-node-pool>
tolerations:
- key: "workload.anyscale.com/type"
value: "<workload-type>"
effect: "NoSchedule"
- key: "nvidia.com/gpu"
value: "present"
effect: "NoSchedule"
Example test code
import subprocess
import pynvml
import ray
@ray.remote(num_gpus=1)
def gpu_function():
subprocess.run(["nvidia-smi"], check=True)
pynvml.nvmlInit()
try:
print("Num GPUs:", pynvml.nvmlDeviceGetCount())
print("GPU Names:", [
pynvml.nvmlDeviceGetName(pynvml.nvmlDeviceGetHandleByIndex(i))
for i in range(pynvml.nvmlDeviceGetCount())
])
finally:
pynvml.nvmlShutdown()
if __name__ == "__main__":
ray.get(gpu_function.remote())
Unexpected output
(gpu_function pid=<pid>, ip=<node-ip>) Num GPUs: 8
(gpu_function pid=<pid>, ip=<node-ip>) GPU Names: ['NVIDIA A100-SXM4-80GB', 'NVIDIA A100-SXM4-80GB', ...]
The pod sees all 8 GPUs on the node instead of only the 1 GPU it requested.
Common causes
Three causes account for most cases of this behavior: a container image that overrides GPU visibility, a pod spec that doesn't request a GPU resource, and a misconfigured device plugin or container runtime.
Image sets NVIDIA_VISIBLE_DEVICES to all
If your container image sets the environment variable NVIDIA_VISIBLE_DEVICES=all, the pod can see all GPUs on the node.
Check your Dockerfile and remove or modify the NVIDIA_VISIBLE_DEVICES setting:
# Remove this line or don't set it. Let the Kubernetes device plugin handle GPU assignment:
# ENV NVIDIA_VISIBLE_DEVICES all
See the NVIDIA Container Toolkit documentation for details.
Pod didn't request or limit GPU
If the pod specification doesn't include GPU requests and limits, the NVIDIA device plugin won't inject the environment variables needed to restrict GPU visibility.
Ensure your pod specification includes GPU resources:
resources:
requests:
nvidia.com/gpu: 1
limits:
nvidia.com/gpu: 1
See the NVIDIA device plugin documentation for details.
Device plugin or runtime misconfigured
The NVIDIA device plugin or container runtime may not be running or may be using the wrong strategy.
Verify the NVIDIA device plugin is running:
kubectl get pods -n kube-system -l app=nvidia-device-plugin
Check the device plugin configuration and ensure it's using the appropriate strategy for your use case.
Verification
After applying fixes, verify that the pod can only see the requested number of GPUs:
import pynvml
pynvml.nvmlInit()
try:
num_gpus = pynvml.nvmlDeviceGetCount()
print(f"Visible GPUs: {num_gpus}")
# Should print "Visible GPUs: 1"
finally:
pynvml.nvmlShutdown()
Out of scope
This article covers GPU visibility for pods that request a subset of GPUs on a multi-GPU node. RDMA setups use /dev/infiniband* devices and can produce related symptoms such as wrong GPU counts in the pod. The diagnostics and fixes for RDMA visibility differ from those documented here and aren't covered.