Skip to main content

Configure head node fault tolerance

Configure head node fault tolerance

This page provides an overview of configuring head node fault tolerance for Anyscale services. Anyscale recommends enabling head node fault tolerance for all production services.

Anyscale recommends configuring head node fault tolerance at the cloud level so all services in the cloud share a single Redis-compatible cluster by default. See Enable head node fault tolerance for an Anyscale cloud.

To override the cloud-level configuration for a specific service or to enable fault tolerance without setting a cloud-level endpoint, configure ray_gcs_external_storage_config in the service config. See Manually configure fault tolerance for an Anyscale service.

important

Head node fault tolerance isn't supported for serverless Anyscale clouds (also called Anyscale-hosted clouds).

What is head node fault tolerance?

Head node fault tolerance uses a Redis-compatible external storage cluster to prevent service outages due to head node instability, out-of-memory issues, or machine failure.

With head node fault tolerance enabled, Anyscale services can continue to serve responses using replicas on worker nodes during head node recovery.

Enable head node fault tolerance for an Anyscale cloud

Anyscale recommends that you enable head node fault tolerance at the cloud level. When you enable head node fault tolerance for an Anyscale cloud, all services deployed in that cloud use the feature by default.

important

When you enable head node fault tolerance, Anyscale deploys additional resources in your cloud provider account that have ongoing costs. While Anyscale doesn't charge for these resources, they incur charges in your cloud provider account even if you aren't using Anyscale services.

If you need help disabling head node fault tolerance, contact Anyscale support.

The setup process depends on how you created your Anyscale cloud.

Clouds created with cloud setup

If you use anyscale cloud setup to deploy an Anyscale cloud on the VM stack, include the --enable-head-node-fault-tolerance flag during cloud deployment. Anyscale automatically configures MemoryDB or Memorystore in your cloud provider account.

To update an existing cloud, run the following command:

anyscale cloud update --name <cloud-name> --enable-head-node-fault-tolerance

Clouds created with cloud register

For clouds created with cloud register, you provision the external storage backend yourself and reference it in your cloud resource YAML.

To enable head node fault tolerance on a registered cloud, complete the following steps:

  1. Provision the required storage backend:

    • AWS VM stack: Create an Amazon MemoryDB cluster in the same VPC as your Anyscale cloud. The cluster must use TLS, a single shard, and at least one replica.
    • Google Cloud VM stack: Create a Memorystore instance in the same network as your Anyscale cloud.
    • Kubernetes: Provision a Redis-compatible cluster reachable from your Kubernetes data plane. See Requirements for cluster requirements.
  2. Export your cloud configuration:

    anyscale cloud get --name <cloud-name> --output cloud-resources.yaml
  3. Add the storage backend to the YAML:

    • AWS VM stack: Add memorydb_cluster_name under aws_config.
    • Google Cloud VM stack: Add memorystore_instance_name under gcp_config.
    • Kubernetes: Add redis_endpoint under kubernetes_config. See Kubernetes cloud resource example.
  4. Update the cloud:

    anyscale cloud update --name <cloud-name> --resources-file cloud-resources.yaml

Anyscale also provides Terraform modules to help configure Redis-compatible storage. See Introduction to Anyscale clouds.

Kubernetes cloud resource example

The following example shows the redis_endpoint field in a Kubernetes cloud resource YAML:

name: my-k8s-cloud-resource
provider: AWS
compute_stack: K8S
region: us-west-2
object_storage:
bucket_name: s3://my-bucket
kubernetes_config:
anyscale_operator_iam_identity: arn:aws:iam::123456789012:role/eks-node-role
zones:
- us-west-2a
- us-west-2b
redis_endpoint: redis.ray-system.svc.cluster.local:6379

The endpoint must be reachable from the data plane that runs your Anyscale workloads. Use the address pattern <host>:<port> for plaintext connections. For TLS, prefix the address with rediss://. For example: rediss://redis.ray-system.svc.cluster.local:6379.

If your Redis cluster uses TLS with a private certificate, configure certificate_path per service. See Configure fault tolerance in your service config.

Turn off head node fault tolerance for a service

You can turn off head node fault tolerance for a service in your Anyscale cloud by using the ray_gcs_external_storage_config in the service config, as in the following example:

name: my-service
applications:
- import_path: main:app
ray_gcs_external_storage_config:
enabled: False

Disabling fault tolerance using the service config doesn't remove the Redis-compatible cluster from your Anyscale cloud deployment or deprovision resources in your cloud provider. If you need help removing this infrastructure, contact Anyscale support.

Manually configure fault tolerance for an Anyscale service

This section covers manually provisioning a Redis-compatible cluster and configuring fault tolerance at the service level. Use this approach to override the cloud-level configuration for a specific service or to enable fault tolerance without setting a cloud-level endpoint.

note

Anyscale recommends configuring head node fault tolerance at the cloud level. See Enable head node fault tolerance for an Anyscale cloud.

Cloud-level configuration shares a single Redis cluster across all services in your Anyscale cloud, which is sufficient for most use cases. Use per-service configuration only when you need a dedicated Redis cluster for a specific service.

Requirements

Your Redis-compatible cluster must meet the following requirements:

note

You can't use a multi-shard Redis cluster. Anyscale only supports single shard Redis clusters and recommends replication across availability zones.

Anyscale doesn't support TLS for Google Cloud Memorystore.

  • Accessible from your Kubernetes cluster's network.
  • Single shard configuration.
  • At least 1 replica for high availability.
  • At least 1 GiB of storage.
    • A 10-node service initially requires around 20 MB of storage. Over time, the usage for a cluster can increase to 100 MB or more.

Configure fault tolerance in your service config

Once you have provisioned a Redis-compatible cluster, add the RayGCSExternalStorageConfig config to the ServiceConfig to enable head node fault tolerance, as in the following examples:

name: my-service
working_dir: .
applications:
- import_path: main:app
ray_gcs_external_storage_config:
enabled: True
address: redis-cluster-hostname:6379
# Path to TLS certificates if enabled.
certificate_path: "/etc/ssl/certs/ca-certificates.crt"
  • Use the following address pattern for Google Cloud Memorystore: <ip-address>:<port>
  • Use the following address pattern for AWS MemoryDB: <user-provided-name>.<random-string>.clustercfg.memorydb.<region>.amazonaws.com:6379
  • If you have TLS enabled, prefix the address with rediss://. For example: rediss://<user-provided-name>.<random-string>.clustercfg.memorydb.<region>.amazonaws.com:6379.
  • The certificate_path only needs to be updated when using private certificates.

Configure alerting for your fault tolerance resources

If the Redis-compatible external storage cluster reaches its maximum memory capacity, your services may experience significant disruptions. Anyscale recommends configuring alerts using either AWS CloudWatch or Google Cloud Alerting.

AWS CloudWatch Alerts

  • Configure an alert on the DatabaseMemoryUsagePercentage metric.
  • Configure the alert condition to trigger if the maximum value exceeds 80%.

If the alarm triggers, Anyscale recommends either terminating services to alleviate the memory load or scaling up the Redis-compatible cluster's memory capacity.