Troubleshoot CUDA errors in training jobs
Troubleshoot CUDA errors in training jobs
Training jobs can fail with CUDA errors before training begins because of hardware failures or GPU initialization issues. These failures are less common than errors caused by dependency issues or user code, but are common enough in distributed training that recognizing the failure mode helps you diagnose quickly.
Symptoms
Jobs fail with errors similar to:
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,
so the stacktrace below might be incorrect.
The error typically occurs during GPU initialization:
File "/opt/state/venv/lib/python3.10/site-packages/torch/cuda/__init__.py", line 404, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
The same invalid device ordinal error can surface from manual GPU index selection in user code. This article covers the hardware-initialization variant. If the GPUs you expect are otherwise healthy, check your code for explicit cuda:N device references.
Potential causes
Two failure modes commonly produce these errors: hardware failures captured in kernel logs, and GPUs that the driver never detects.
Hardware failure
Check system logs (dmesg) for GPU hardware warnings or errors that occurred before the application ran:
{
"message": "NVRM: Xid (PCI:0000:00:1b): 119, pid='<unknown>', name=<unknown>,
Timeout after 6s of waiting for RPC response from GPU0 GSP! Expected function 4097
(GSP_INIT_DONE) (0x0 0x0).",
"timestamp": "2025-07-17T22:51:47.727284Z"
}
To access dmesg output, do any of the following:
- In the Anyscale console, with log ingestion enabled, open the Logs tab and select the Kernel component from the component dropdown. See Log ingestion and query.
- If log ingestion isn't enabled, click Download on the Logs tab to get the Anyscale CLI command that downloads the full logs including
dmesgoutput. - SSH directly to the node. See Use SSH to access worker nodes.
When searching dmesg, grep for NVRM to surface most Nvidia kernel messages. Hardware failures usually appear as Xid error codes, so starting there typically gets you to the diagnostic quickly.
Missing GPU detection
If GPU health checks with dcgmi don't identify hardware issues and mark the node as unhealthy, low-level failures may prevent the driver from detecting the GPU. Validate that the expected number of GPUs are available and healthy before starting the workload. For a quick check, run nvidia-smi and confirm every GPU you expect appears in the output. A missing entry indicates the GPU isn't detected at the driver level.
Investigation steps
When a job fails with these errors, work through the following steps to localize the problem.
Check application logs
Review the full error traceback to identify where the CUDA initialization fails:
ray.exceptions.RayTaskError(RuntimeError): ray::_RayTrainWorker__execute.get_next()
(pid=<pid>, ip=<node-ip>, actor_id=<actor-id>,
repr=<ray.train._internal.worker_group.RayTrainWorker object at 0x73c33025b4c0>)
File "/opt/state/venv/lib/python3.10/site-packages/ray/train/torch/config.py", line 27, in __enter__
torch.cuda.set_device(device)
File "/opt/state/venv/lib/python3.10/site-packages/torch/cuda/__init__.py", line 404, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
Check cluster event logs
Look for node status changes or GPU health warnings in the cluster event logs around the time of the failure.
Verify GPU health checks
Anyscale performs GPU health checks using dcgmi. If these checks don't catch the hardware failure, do the following:
- Review the GPU health check results in the cluster logs.
- Ensure health checks can detect low-level hardware failures.