Skip to main content

Configure multiple resources for an Anyscale cloud

Configure multiple resources for an Anyscale cloud

This page provides an overview of adding multiple resource configurations to an Anyscale cloud. You configure multiple resource configurations so that Anyscale jobs can fall back to using resources in another region or cloud provider when resources in your primary configuration aren't available for your Anyscale clusters.

important

This feature is in beta release.

You can only configure fallback across multiple resources for Anyscale jobs. Fallback isn't supported for workspaces or services.

Add a cloud resource configuration to an Anyscale cloud

You can add additional cloud resource configurations to any self-hosted Anyscale cloud. See Cloud CLI.

note

Anyscale serverless clouds (also called Anyscale-hosted clouds) are the only clouds that don't support multiple resource configurations.

You must be an owner on an existing Anyscale cloud to add a new resource configuration. Complete the following steps:

  1. Define your cloud resource configuration in a YAML file on your local machine, such as the following example saved to /path/to/cloud-resources.yaml:

    name: new-cloud-resource-name
    provider: AWS
    compute_stack: VM
    region: us-west-2
    networking_mode: PUBLIC
    object_storage:
    bucket_name: s3://my-bucket
    aws_config:
    vpc_id: vpc-123
    subnet_ids:
    - subnet-123
    security_group_ids:
    - sg-123
    anyscale_iam_role_id: arn:aws:iam::123456789012:role/anyscale-role-123
    cluster_iam_role_id: arn:aws:iam::123456789012:role/cluster-role-123
    memorydb_cluster_name: my-memorydb-cluster

    For cloud resource config parameters, see CloudResource.

  2. Run the following command to use this YAML config file to create a cloud resource configuration in your existing cloud:

    anyscale cloud resource create --cloud <cloud-name> --file /path/to/cloud-resources.yaml

    For Kubernetes resources (compute_stack: K8S), the command outputs a cloud resource ID in the format cldrsrc_xxx and skips verification automatically. Verification can't run before the Anyscale operator is installed in the target cluster, and the operator's Helm chart needs the cloud deployment ID. After the command returns the ID, install the Anyscale operator with global.cloudDeploymentId set to that value, then run anyscale cloud verify to validate the resource. See Deploy Anyscale on non-managed Kubernetes (cloud register) for the operator install steps.

Update multiple resources for a cloud

You can use the CLI to update the configurations for multiple resources for an Anyscale cloud. You can only modify existing resources this way.

important

The anyscale cloud update command expects a resources YAML file formatted as a list of resources, as in the following example:

- name: k8s-azure-eastus
provider: AZURE
compute_stack: K8S
region: eastus
object_storage:
bucket_name: abfss://<container-name>@<storage-account-name>.dfs.core.windows.net
kubernetes_config:
zones:
- eastus-1
- eastus-2
- eastus-3
- name: k8s-azure-westus
provider: AZURE
compute_stack: K8S
region: westus
object_storage:
bucket_name: abfss://<container-name>@<storage-account-name>.dfs.core.windows.net
kubernetes_config:
zones:
- westus-1
- westus-2
- westus-3

Complete the following steps to update resources for a cloud:

  1. Save the current cloud configuration as a YAML file using the following CLI command:

    anyscale cloud get --name <cloud-name> --output /path/to/cloud-resources.yaml
  2. Open the file in your preferred text editor or IDE.

    • You must format your file to match the structure of a cloud resources YAML. Modify the output of anyscale cloud get to remove everything except the items under resources:.
    • Edit settings for one or more of your resources.
    • Save the file.
  3. Run the following CLI command to apply your saved configuration:

    anyscale cloud update --name <cloud-name> --resources-file /path/to/cloud-resources.yaml

Remove a cloud resource configuration from an Anyscale cloud

Run the following code to remove a cloud resource configuration from a multi-cloud setup:

anyscale cloud resource delete --cloud <cloud-name> --resource <cloud-resource-name>

Replace <cloud-resource-name> with the name of the cloud resource you want to remove.

Configure compute configs for multi-resource clouds

Once you've added multiple cloud resources to an Anyscale cloud, you configure compute configs to specify which resources to use and how to handle fallback between them.

How compute configs work with multiple resources

  • The head node and all worker nodes for a cluster must deploy to the same cloud resource.
  • A cluster deploys to a single cloud resource at a time.
  • When resources aren't available in the primary cloud resource, Anyscale attempts to launch the cluster using the next cloud resource in the configuration.
  • You specify which cloud resource each configuration should use with the cloud_resource field in each config entry.

Simple example with auto-selected workers

The following example shows a minimal multi-resource compute config that uses auto-selected worker nodes. When you don't specify a head node, Anyscale uses the default for the cloud resource:

  • m5.2xlarge for AWS VMs
  • n2-standard-8 for Google Cloud VMs
  • Smallest CPU-only instance type for Kubernetes
cloud: my-multi-resource-cloud
configs:
- cloud_resource: vm-gcp-us-west1
auto_select_worker_config: true
- cloud_resource: vm-aws-us-west-2
auto_select_worker_config: true

For more information about auto-selected workers, see Auto-select worker nodes.

Multi-resource compute config examples

The following examples show compute configurations that correspond to the cloud resource configurations shown in the preceding section. Each example includes head node and worker group definitions for both CPU and GPU workloads.

note

By default, Anyscale prioritizes cloud resources using the availability strategy. You can also prioritize resources using the input_order strategy. See Control resource selection strategy.

Configure fallback between two AWS regions with CPU and GPU worker groups.

cloud: my-multi-resource-cloud
configs:
- cloud_resource: vm-aws-us-west-2
head_node:
instance_type: m5.2xlarge
worker_nodes:
- instance_type: m5.4xlarge
min_nodes: 0
max_nodes: 10
market_type: SPOT
- instance_type: g5.4xlarge
min_nodes: 0
max_nodes: 5
market_type: SPOT
- cloud_resource: vm-aws-us-east-1
head_node:
instance_type: m5.2xlarge
worker_nodes:
- instance_type: m5.4xlarge
min_nodes: 0
max_nodes: 10
market_type: SPOT
- instance_type: g5.4xlarge
min_nodes: 0
max_nodes: 5
market_type: SPOT

Control resource selection strategy

By default, Anyscale uses internal availability scoring to choose which cloud resource to use when launching a cluster. Anyscale bases this scoring on historical success rates for launching instances across different cloud resources and regions. The scoring considers only the resources requested, not the actual contents of your workloads.

You can override this behavior by setting the cloud_resource_strategy flag to input_order, which attempts to launch clusters using cloud resources in the order they're listed in your compute config.

The following example shows how to configure the resource selection strategy:

cloud: my-multi-resource-cloud
configs:
- cloud_resource: vm-gcp-us-west1
auto_select_worker_config: true
- cloud_resource: vm-aws-us-west-2
auto_select_worker_config: true
flags:
min_resources:
'accelerator_type:A100-40G': 1
cloud_resource_strategy: input_order

Available strategies

  • availability (default): Anyscale chooses the cloud resource with the highest availability score from historical launch success rates.
  • input_order: Anyscale attempts to launch clusters using cloud resources in the order specified in the configs list.

Set the resource starting timeout

The cloud_resource_starting_timeout flag controls how long Anyscale waits for the minimum required nodes to start on a cloud resource before falling back to another cloud resource. The default is 15 minutes.

Anyscale only applies this timeout while the cluster is in the PREPARING_MIN_INSTANCES state. When the timeout expires, Anyscale terminates the partially started instances on the current cloud resource, marks that resource as failed for the current launch attempt, and selects the next cloud resource using the active strategy.

Use duration format such as 30m for 30 minutes or 1h for one hour.

The following example sets a 30-minute timeout with the input_order strategy:

cloud: my-multi-resource-cloud
configs:
- cloud_resource: vm-aws-us-west-2
auto_select_worker_config: true
- cloud_resource: vm-aws-us-east-1
auto_select_worker_config: true
flags:
cloud_resource_strategy: input_order
cloud_resource_starting_timeout: 30m

Accelerator requirements

When using min_resources with accelerator types such as 'accelerator_type:A100-40G', the specified accelerator must be available in the target cloud resources. For VM deployments, both cloud resources must have instance types with the specified GPU available. For Kubernetes deployments, you must define the accelerator types in your custom instance type configurations.

For more information about accelerator types and auto-selected workers, see Auto-select worker nodes.

Limitations

The following limitations exist for multiple resource configurations:

  • Only available in Anyscale clouds created using cloud register.
  • Only jobs support configuration with multiple resources.
  • Ray clusters can only launch in one resource at a time for each job.
  • Services can only start on the primary cloud resource. Services aren't supported on other cloud resources within a multi-resource cloud.
  • No support for task or actor dashboards.
  • For Kubernetes cloud resources, the system storage bucket must reside in the same cloud region as the cluster and VPC. The Anyscale Operator's storage health check derives its object-storage endpoint from the resource's region field, so a bucket in a different region fails the health check.