Skip to main content

Advanced settings for compute configs on Anyscale

Advanced settings for compute configs on Anyscale

This page provides an overview of advanced features for compute configs on Anyscale. These options are common to all Anyscale resource configurations.

Some compute config settings are cloud-specific. See the following pages:

Configure threshold for checking for spot instances

When you configure worker nodes to prefer spot instances but fall back to on-demand virtual machines, you can customize how quickly Anyscale checks for spot instance availability after falling back to on-demand instances. By default, Anyscale sets this value to 60 minutes. Set a lower value to instruct Anyscale to preempt on-demand worker nodes in your cluster when spot instances become available.

Use the replacement_threshold setting to override this default behavior. You can configure this setting for your entire cluster or for individual worker node groups. Valid time units are s (seconds), m (minutes), and h (hours).

To set a cluster-wide replacement threshold from the Anyscale console, use the Advanced features tab under the Advanced settings for the cluster. The following example sets the threshold to 15 minutes for all nodes in the cluster:

{
"replacement_threshold": "15m"
}

To override the cluster-wide threshold for a specific worker group, use the Advanced features section within the Advanced config for that worker node group. The same JSON syntax applies.

You also use this setting to configure node replacement groups. See Configure node replacement.

Adjustable downscaling

The Anyscale platform automatically downscales worker nodes that have been idle for a given period. By default, the timeout period ranges from 30 seconds to 4 minutes and is dynamically adjusted for each node group based on the workload. For example, short, bursty workloads have shorter timeouts and more aggressive downscaling. Adjust this timeout value at the cluster level based on your workload needs.

To adjust the timeout value from the Anyscale console, use the Advanced features tab under the Advanced settings for the cluster. This example sets the timeout to 60 seconds for all nodes in the cluster.

{
"idle_termination_seconds": 60
}

Cross-zone scaling

Cross-zone scaling launches your Ray cluster across multiple availability zones. By default, all worker nodes launch in the same availability zone. With cross-zone scaling enabled, Anyscale first attempts to launch worker nodes in existing zones. If that fails, Anyscale tries the next-best zone based on availability.

Use this feature if:

  • You want to maximize the chances of provisioning desired instance types.
  • You want to spread Serve app replicas across multiple zones for better resilience and availability.
  • Your workloads have no heavy inter-node communication or the incurred inter-availability zone cost is acceptable.

To enable or disable this feature from the Anyscale console, use the "Enable cross-zone scaling" checkbox under the Advanced settings for the cluster.

Resource limits

Cluster-wide resource limits define minimum and maximum values for any resource across all nodes in the cluster. Common use cases for this feature include the following:

  1. Specifying the maximum number of GPUs to avoid unintentionally launching a large number of expensive instances.
  2. Specifying a custom resource for specific worker nodes and using that custom resource value to limit the number of nodes of those types.

To set the maximum number of CPUs and GPUs in a cluster from the Anyscale console, use the "Maximum CPUs" and "Maximum GPUs" fields under the Advanced settings for the cluster.

To set other resource limits, use the Advanced features tab under the Advanced settings for the cluster. To add a custom resource to a node group, use the Ray config tab under the Advanced config section for that node group.

This example limits the minimum resources to 1 GPU and 1 <custom-resource> and limits the maximum resources to 5 <custom-resource>.

{
"min_resources": {
"GPU": 1,
"<custom-resource>": 1
},
"max_resources": {
"<custom-resource>": 5,
}
}

Workload starting and recovering timeouts

The workload starting timeout configures how long a workload should attempt to acquire the minimum resources when it first starts. If the timeout expires, Anyscale terminates the workload.

After a workload is running, it may enter the RECOVERING state if it's attempting to recover the minimum resources. This can happen for several reasons, such as spot preemption. The workload recovering timeout configures how long a workload may remain in the RECOVERING state. This avoids the cost of idling existing nodes.

By default, Anyscale sets both timeouts to 25 minutes.

info

These timeouts only apply to jobs and workspaces, not services.

To configure the workload starting and recovering timeouts from the Anyscale console, use the Advanced features tab under the Advanced settings for the cluster. This example increases the workload starting timeout to 1 hour and decreases the workload recovering timeout to 10 minutes.

Valid time units are: s, m, and h. For example, 1h30m.

{
"workload_starting_timeout": "1h",
"workload_recovering_timeout": "10m"
}

Zonal startup timeout

Anyscale attempts to pack cluster and worker group nodes into the same zone to improve cluster communication performance and minimize cross zone data transfer.

However, machines may not always be available because of capacity constraints, IP addresses, or other reasons. If Anyscale is unable to get the requested minimum resources, Anyscale terminates any existing nodes and sequentially tries the request in a different zone within the cloud deployment.

By default, Anyscale sets the zonal startup timeout to 10 minutes.

To configure the workload zonal startup timeout from the Anyscale console, use the Advanced features tab under the Advanced settings for the cluster. This example decreases the workload zonal startup timeout to 5 minutes.

Valid time units are: s, m, and h. For example, 1h30m.

{
"zone_starting_timeout": "5m"
}

Configure instance ranking and replacement

You can configure custom worker group ranking, selection strategy, and replacement behavior.

Anyscale uses the following defaults when adding nodes to your Ray cluster:

  • Use the smallest node feasible for the workload.
  • Don't add GPU worker nodes for CPU-only workloads.
  • Prioritize CPU-only worker groups over GPU worker groups.
  • Prioritize spot instances over on-demand.
  • Prioritize available instance types.

The following sections describe customizing this behavior.

Instance selection strategy

Set the instance_selection_strategy parameter to relaxed to override default behavior to prefer smaller nodes.

To configure the instance ranking strategy from the Anyscale console, use the Advanced features tab under the Advanced settings for the cluster.

{
"instance_selection_strategy": "force_smallest_fit" | "relaxed",
}

Price-based ranking strategy

You can use instance prices to determine ranking for worker groups.

important

This feature is in beta release and only available for Anyscale clouds on AWS using virtual machines.

When you enable price-based ranking, Anyscale makes ranking decisions for worker selection based on pricing details provided by AWS. You can optionally add pricing weights to instance types in your worker group configurations. Anyscale applies pricing weights by dividing the cost by the pricing weight, meaning that higher pricing weights increase the priority for an instance type.

note

Anyscale doesn't refresh prices from AWS in real-time. Spot prices refresh asynchronously every three hours, while on-demand prices refresh at least monthly.

Price-based ranking treats capacity reservations and machine pool instances as having no price, meaning these instances have the highest priority. Price-based ranking uses the relaxed instance selection strategy, meaning that Anyscale might deploy larger instance types if they have a lower price.

You enable price-based ranking for the entire cluster using the following syntax:

In the Anyscale console, use the Advanced features tab under the Advanced settings for the cluster.

{
"instance_ranking_strategy": [
{"ranker_type": "price"}
]
}

You enable pricing weights for instance types using the following syntax:

In the Anyscale console, use the Advanced config > Flags section for each worker node.

{"pricing_weight": 2}
important

You can combine price-based ranking with worker group ranking. Anyscale applies ranking strategies in the order you specify. Earlier rankers take precedence, and later rankers break ties.

For example, to prefer worker group order but break ties by price, list custom_group_order first, then price as separate items in instance_ranking_strategy:

flags:
instance_ranking_strategy:
- ranker_type: custom_group_order
ranker_config:
group_order:
- worker-group-1
- worker-group-2
- ranker_type: price

In this configuration, the custom_group_order ranker type specifies preference across groups, treating all instances within a group as tied. The price ranker then breaks ties within each group by selecting cheaper instances first.

Worker group ranking

Use worker group ranking to set custom prioritization rules for worker groups. You can specify fallback ordering so that Anyscale uses less-preferred instances for fallback or large scaling events while still using your preferred instance types when possible. For example, configure a workload to prefer reserved capacity instances, then spot instances, then on-demand instances.

Specify rules in order of ranking preference, with higher-ranked worker groups listed first. Node replacement also respects worker group ranking and automatically replaces lower-ranked nodes with higher-ranked nodes when they become available. See Configure node replacement.

The following example demonstrates configuring group ranking for three worker groups named spot-worker-1, spot-worker-2, and on-demand-worker. In this example, Anyscale prioritizes the spot groups over the on-demand group, but the two spot groups have equal priority.

Specify worker group ranking in the Advanced features tab under the Advanced settings:

{
"instance_ranking_strategy": [
{
"ranker_type": "custom_group_order",
"ranker_config": {
"group_order": [
["spot-worker-1", "spot-worker-2"],
"on-demand-worker"
]
}
},
]
}

Configure node replacement

info

The node replacement feature is in beta release.

Node replacement works alongside custom worker group ranking to automatically replace nodes launched from less-preferred worker groups when resources become available in a higher-ranking worker group. See Worker group ranking.

Anyscale only attempts to launch a replacement node that's at least as large as the existing node. This comparison takes into account CPUs, GPUs, memory, accelerator type, and any user-defined custom resources. This ensures that any workload running on the existing node can run on the replacement node. Configure the replacement threshold to match your workload checkpointing. The replacement threshold is the duration a worker node must run for before Anyscale can replace it.

caution

When Anyscale replaces a node, it sends the same interruption signal to Ray tasks and actors as a spot preemption. Anyscale doesn't wait for running tasks to finish before terminating the replaced node. Design your workloads to handle preemption gracefully through checkpointing or fault tolerance. See Configure threshold for checking for spot instances.

One common use case for this feature is prioritizing spot instances over on-demand instances. When you set market_type: PREFER_SPOT on a worker group, Anyscale automatically splits it into two internal groups: {name}/spot and {name}/on-demand. Anyscale then generates a ranking that prefers spot. You don't need to manually configure instance_ranking_strategy for this case. Set replacement_threshold and Anyscale handles the rest.

The following example configures two PREFER_SPOT worker groups with a replacement threshold of 30 minutes. Anyscale launches spot instances first, falls back to on-demand if spot isn't available, and replaces on-demand nodes with spot after 30 minutes.

Set the replacement_threshold using the Advanced features tab under the Advanced settings for the cluster.

Valid time units are s, m, and h. For example, 1h30m.

{
"replacement_threshold": "30m"
}

Explicit ranking with node replacement

For more complex scenarios, configure instance_ranking_strategy with enable_replacement explicitly. This gives you full control over the group ordering. For example, you can prefer reserved capacity, then spot, then on-demand across multiple worker groups.

Each entry in group_order can be a string for a single group or a list of strings for groups that share a tier. Groups within the same nested list rank equally, and Anyscale replaces nodes from a lower tier only when a higher tier has capacity.

The following example combines a reserved-capacity worker group with two PREFER_SPOT worker groups. Anyscale auto-splits each PREFER_SPOT group into {name}/spot and {name}/on-demand subgroups. The group_order ranks reserved-gpu highest, both spot subgroups equally in the middle tier, and both on-demand subgroups equally in the lowest tier. Anyscale replaces lower-ranked nodes after 30 minutes when higher-ranked capacity becomes available.

Specify the custom group order with enable_replacement set to true and the replacement_threshold, using the Advanced features tab under the Advanced settings for the cluster.

{
"replacement_threshold": "30m",
"instance_ranking_strategy": [
{
"ranker_type": "custom_group_order",
"ranker_config": {
"enable_replacement": true,
"group_order": [
"reserved-gpu",
["worker-1/spot", "worker-2/spot"],
["worker-1/on-demand", "worker-2/on-demand"]
]
}
}
]
}

Configure Ray node resources

Ray uses logical resources on each node to schedule tasks and actors. By default, Ray auto-detects logical resources from the underlying instance, with the following allocations:

ResourceDescriptionDefault
CPULogical CPUs available for tasks and actors.Physical CPU count of the instance.
GPULogical GPUs available for tasks and actors.Physical GPU count of the instance.
memoryMemory available for tasks and actors, in bytes.70% of node RAM.
object_store_memoryMemory reserved for the Ray object store, in bytes.30% of node RAM.

For background on logical resources, see Specifying node resources in the Ray docs.

Override the defaults using the resources field on the head node or any worker group in your compute config. You can also add custom resources to this field for scheduling. Anyscale recommends label-based scheduling instead of custom resources for most use cases. See Use labels to control scheduling.

note

Anyscale doesn't support modifying Ray node resources through ray.init(), ray start, or ray up. Use the compute config instead.

In the Anyscale console, open the Advanced config > Ray config tab on the head node or worker group when creating or versioning a compute config. Use the Custom resources JSON field. The CPU, GPU, memory, and object_store_memory keys override Ray's pre-defined resources. Any other keys define custom resources for scheduling. See Create or version a compute config.

The following example overrides memory and object_store_memory to give tasks and actors more headroom and the object store a larger budget than the defaults:

{
"memory": 90000000000,
"object_store_memory": 30000000000
}
note

The resources field accepts numeric values only. Memory unit suffixes such as Gi and Mi aren't supported here. The declarative compute config required_resources field accepts unit suffixes for free pod shapes. See Declarative compute configs.

Object store memory and /dev/shm

On Linux, the Ray object store is backed by /dev/shm, so setting object_store_memory above the available /dev/shm capacity has no effect. On Kubernetes, /dev/shm defaults to a fraction of pod memory. Ask a cloud admin to adjust the pod shape if you need a larger object store. See Configure the Helm chart for the Anyscale operator.

Disable NFS mounts

Anyscale clouds deployed with NFS use NFS mounts for shared storage locations.

NFS can lead to issues in large jobs or services because of resource limits for concurrent connections enforced by cloud providers.

note

Anyscale clouds deployed after July 14, 2025, use cloud object storage for shared storage locations by default. All cloud deployments can optionally configure NFS.

You can use the CLI command anyscale cloud get to see if your Anyscale cloud has NFS configured.

To disable NFS mounts from the Anyscale console, use the Advanced features tab under the Advanced settings for the cluster.

{
"disable_nfs_mount": true,
}

Control head node scheduling

Anyscale turns off head node scheduling by default for multi-node clusters. This protects the Ray cluster from instability, as the head node contains important system processes such as the global control service, the Ray driver process, and the API server.

Anyscale recommends against scheduling on the head node. Scheduling on the head node can lead to the following under heavy load:

  • The actors and tasks running on the head node contend for resources and interrupt the operation of these system components.
  • The Ray cluster becomes unstable and unable to recover from failures properly.

Anyscale schedules to the head node in the following configurations:

  • You define logical resources in the Advanced config > Ray config for the head node of your Anyscale cluster.
  • You configure a single-node cluster by only defining a head node and no worker nodes.

The following table shows examples of several compute configs and describes cluster shape and head node scheduling:

Cluster shapeConfig YAMLDescription
Single-node cluster
head_node:
instance_type: m5.2xlarge
auto_select_worker_config: false
Single-node clusters only define the head node. In this configuration, all compute runs on the head node.
Multi-node cluster with auto-selected workers
head_node:
instance_type: m5.2xlarge
auto_select_worker_config: true
Define a head node and specify auto_select_worker_config to allow Anyscale to automatically select worker nodes for your cluster. All compute runs on worker nodes.
Multi-node cluster with defined workers
head_node:
instance_type: m5.2xlarge
worker_nodes:
- name: gpu-group
instance_type: p4de.24xlarge
If you manually specify worker nodes, all compute runs on worker nodes.
Multi-node cluster with head scheduling
head_node:
instance_type: m5.2xlarge
resources:
CPU: 8
GPU: 0
worker_nodes:
- name: gpu-group
instance_type: p4de.24xlarge
In a multi-node cluster, specify CPU resources on the head node to enable scheduling on the head node.

Use labels to control scheduling

You can use labels to control scheduling.

important

This feature is in beta release.

Don't use this feature with node replacement, as node replacement doesn't respect labels when evaluating ranks. This can result in attempting to assign an actor or task to a node in a worker group that doesn't have the correct labels.

Anyscale supports labels for all configured cloud resources. All labels are string values.

The following table describes the default labels set for each node in your cluster:

LabelDescription
ray.io/node-idA unique ID generated for the node.
ray.io/accelerator-typeThe accelerator type of the node, for example L4. CPU-only machines don't have the label.
ray.io/market-typeIndicates whether the node uses spot instances or on-demand instances.
ray.io/node-groupThe name of the node worker group or head for the head node. Invalid characters in worker group names are replaced by underscores (_). For example, the Anyscale default name 1xT4:4CPU-16GB becomes 1xT4_4CPU-16GB.
ray.io/regionThe cloud region of the node.
ray.io/availability-zoneThe availability zone of the node.

You can add custom labels to your nodes using the Anyscale console, CLI, or SDK. You specify labels as JSON-formatted key-value pairs.

In the Anyscale console, you can find the field for defining labels by navigating to Advanced config > Ray config > Labels for the head node or worker node group when creating or versioning a compute config. See Create or version a compute config.

You use labels by adding them to your Ray code with the following syntax:

@ray.remote(label_selector={"label_key": "label_value"})
pass

For example, use the following to force a function to only schedule to instances with Nvidia L4 GPU:

@ray.remote(label_selector={"ray.io/accelerator-type": "L4"})
def f():
pass

Example: Schedule Ray code to a specific worker group on Anyscale

This example demonstrates how you can use custom labels to directly specify a worker group for scheduling Ray code. Here the pattern controls scheduling Ray code to worker groups with CPU-only machines.

When defining a Ray cluster, add the following custom label to the configuration for each worker group with CPU-only machines:

{
"cpu-only": "True"
}

In your Ray code, specify the custom label and value using the label_selector option for @ray.remote(), as in the following example:

import ray

@ray.remote(label_selector={"cpu-only": "True"})
def hello_world():
return "Hello World!"

result = ray.get(hello_world.remote())
print(result)

This pattern ensures that Ray always schedules the hello_world function to nodes in a worker group with the cpu-only label set to True.