---
title: "Troubleshoot CUDA errors in training jobs"
description: "Troubleshoot CUDA errors that occur before training starts on GPU workloads in Kubernetes."
---

# Troubleshoot CUDA errors in training jobs

Training jobs can fail with CUDA errors before training begins because of hardware failures or GPU initialization issues. These failures are less common than errors caused by dependency issues or user code, but are common enough in distributed training that recognizing the failure mode helps you diagnose quickly.

## Symptoms

Jobs fail with errors similar to:

```text
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,
so the stacktrace below might be incorrect.
```

The error typically occurs during GPU initialization:

```python
File "/opt/state/venv/lib/python3.10/site-packages/torch/cuda/__init__.py", line 404, in set_device
    torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
```

:::note
The same `invalid device ordinal` error can surface from manual GPU index selection in user code. This article covers the hardware-initialization variant. If the GPUs you expect are otherwise healthy, check your code for explicit `cuda:N` device references.
:::

## Potential causes

Two failure modes commonly produce these errors: hardware failures captured in kernel logs, and GPUs that the driver never detects.

### Hardware failure

Check system logs (`dmesg`) for GPU hardware warnings or errors that occurred before the application ran:

```json
{
  "message": "NVRM: Xid (PCI:0000:00:1b): 119, pid='<unknown>', name=<unknown>,
   Timeout after 6s of waiting for RPC response from GPU0 GSP! Expected function 4097
   (GSP_INIT_DONE) (0x0 0x0).",
  "timestamp": "2025-07-17T22:51:47.727284Z"
}
```

To access `dmesg` output, do any of the following:

1.  In the Anyscale console, with log ingestion enabled, open the **Logs** tab and select the **Kernel** component from the component dropdown. See [Log ingestion and query](/monitoring/accessing-logs.md#log-ingestion-and-query).
2.  If log ingestion isn't enabled, click **Download** on the **Logs** tab to get the Anyscale CLI command that downloads the full logs including `dmesg` output.
3.  SSH directly to the node. See [Use SSH to access worker nodes](/workspaces/debugging.md#use-ssh-to-access-worker-nodes).

When searching `dmesg`, grep for `NVRM` to surface most Nvidia kernel messages. Hardware failures usually appear as `Xid` error codes, so starting there typically gets you to the diagnostic quickly.

### Missing GPU detection

If GPU health checks with `dcgmi` don't identify hardware issues and mark the node as unhealthy, low-level failures may prevent the driver from detecting the GPU. Validate that the expected number of GPUs are available and healthy before starting the workload. For a quick check, run `nvidia-smi` and confirm every GPU you expect appears in the output. A missing entry indicates the GPU isn't detected at the driver level.

## Investigation steps

When a job fails with these errors, work through the following steps to localize the problem.

### Check application logs

Review the full error traceback to identify where the CUDA initialization fails:

```python
ray.exceptions.RayTaskError(RuntimeError): ray::_RayTrainWorker__execute.get_next()
(pid=<pid>, ip=<node-ip>, actor_id=<actor-id>,
repr=<ray.train._internal.worker_group.RayTrainWorker object at 0x73c33025b4c0>)
  File "/opt/state/venv/lib/python3.10/site-packages/ray/train/torch/config.py", line 27, in __enter__
    torch.cuda.set_device(device)
  File "/opt/state/venv/lib/python3.10/site-packages/torch/cuda/__init__.py", line 404, in set_device
    torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
```

### Check cluster event logs

Look for node status changes or GPU health warnings in the cluster event logs around the time of the failure.

### Verify GPU health checks

Anyscale performs GPU health checks using `dcgmi`. If these checks don't catch the hardware failure, do the following:

1.  Review the GPU health check results in the cluster logs.
2.  Ensure health checks can detect low-level hardware failures.

---

Previous: [Troubleshoot GPU visibility in pods](/kb/k8s-gpu-visibility.md) | Next: [Troubleshoot slow cluster startup](/kb/k8s-slow-cluster-start.md)