Debugging Ray Data auto-scaling errors

This document is part of a series of debugging tips for the most common Ray + Anyscale errors. It leverages Anyscale's observability suite to identify the root causes of issues. Each error has a real code example to demonstrate the debugging process. These steps are also applicable to other applications encountering similar errors.

This guide specifically focuses on troubleshooting Ray Data auto-scaling errors, typically indicated by the following exception message: ray.exceptions.ActorUnschedulableError: The actor is not schedulable

Example

Ray auto-scaling error example

The job prodjob_idpsahqledl73vynziiz7skbl2 failed on Apr 5, 2025, 7:12:13 AM due to the following error:

  File "/tmp/ray/session_2025-04-05_14-09-09_945299_2378/runtime_resources/working_dir_files/s3_anyscale-test-data-cld-i2w99rzq8b6lbjkke9y94vi5_org_7c1Kalm9WcX2bNIjW53GUT_cld_kvedZWag2qA8i5BjxUevf5i7_runtime_env_packages_pkg_c254f33e466bcf57d12a98035307d31c/main.py", line 50, in <module>
    ds.materialize()
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/dataset.py", line 5294, in materialize
    copy._plan.execute()
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/exceptions.py", line 89, in handle_trace
    raise e.with_traceback(None) from SystemException()
ray.exceptions.ActorUnschedulableError: The actor is not schedulable: Tasks or actors with resource shapes [{GPU: 1, CPU: 1}, {GPU: 1, CPU: 1}] failed to schedule because there are not enough resources for the tasks or actors on the whole cluster.

Step 1: Confirm the auto-scaling error

To verify whether the job failure is due to an auto-scaling issue, check the cluster error logs around the failure time. In this example, approximately 5 minutes before the error occurred, the following cluster auto-scaling log appears. Note the timestamp of this event: Apr 5, 2025, 7:12:05 AM.

[autoscaler] There are insufficient compute resources to run this workload. Common reasons include (a) not adding or configuring node groups with sufficient resources, (b) requesting a resource (e.g. accelerator) that is not available in the region/provider of your Anyscale cloud, or (c) use of certain Ray scheduling strategies, e.g. STRICT_SPREAD, in a way that doesn't allow resources to fit on a node. Requests: 
2 * {"resourcesBundle":{"CPU":1,"GPU":1}};

Ray auto-scaling cluster event

Step 2: Identify the culprit task

To pinpoint the task causing the issue, use the Ray Core Grafana dashboard available on the Anyscale platform. Run a PromQL query to identify the task that was pending for node assignment near the error timestamp: Apr 5, 2025, 7:12:05 AM.

Update the ClusterId field in the query, which you can find in the Ray Core Grafana dashboard:

Find cluster id

sum(ray_tasks{State="PENDING_NODE_ASSIGNMENT",ClusterId=~"<YOUR_CLUSTER_ID>"}) by (Name)

Find culprit task

In this case, the only task pending node assignment or auto-scaling at the time of the error was MapBatches(Sheep).get_location. Note that while MapBatches and get_location are internal Ray Data tasks and actors, Sheep is a user-defined actor. This provides some insight into how users might resolve the issue in their programs. Specifically, in this case, the map batches were configured with the following setting:

ray.data.from_numpy(...)
    .map_batches(
        Rabbit,
        concurrency=1,
    )
    .map_batches(
        Sheep,
        concurrency=2,
        batch_size=4,
        num_gpus=1,  # prolematic resource requests
    )
    .map_batches(
        Pig,
        concurrency=1,
    )

Based on the investigation, you can confidently conclude the following:

The task MapBatches(Sheep).get_location is the primary cause of the auto-scaling error.
Each task requires 1 CPU and 1 GPU, so with concurrency=2, the total requirement is 2 CPUs and 2 GPUs. This exceeds the available resources in the cluster.
To resolve the issue, consider either relaxing the task's resource requirements or increasing the resources allocated to the cluster.

Example​

Step 1: Confirm the auto-scaling error​

Step 2: Identify the culprit task​

Example

Step 1: Confirm the auto-scaling error

Step 2: Identify the culprit task