Configure head node fault tolerance

This page provides an overview of configuring head node fault tolerance for Anyscale services. Anyscale recommends configuring head node fault tolerance for all Anyscale clouds that use services in production.

You can't configure head node fault tolerance for serverless Anyscale clouds (also called Anyscale-hosted clouds).

important

When you enable head node fault tolerance, Anyscale deploys additional resources in your cloud provider account that have ongoing costs. While Anyscale doesn't charge for these resources, they incur charges in your cloud provider account even if you aren't using Anyscale services.

Anyscale clouds on AWS use MemoryDB with 2 GiB of memory and a replica. See Amazon MemoryDB pricing.
Anyscale clouds on Google Cloud use Memorystore with 5 GiB of memory and a replica. See Memorystore for Redis Cluster pricing.

If you need help disabling head node fault tolerance, contact Anyscale support.

What is head node fault tolerance?

Head node fault tolerance uses a Redis-compatible external storage cluster to prevent service outages due to head node instability, out-of-memory issues, or machine failure.

With head node fault tolerance enabled, Anyscale services can continue to serve responses using replicas on worker nodes during head node recovery.

Enable head node fault tolerance for an Anyscale cloud

Anyscale recommends that you enable head node fault tolerance at the cloud level. When you enable head node fault tolerance for an Anyscale cloud, all services deployed in that cloud use the feature by default.

If you use anyscale cloud setup to deploy your Anyscale cloud, you can enable head node fault tolerance during cloud deployment by including the flag --enable-head-node-fault-tolerance. Anyscale automatically configures MemoryDB or Memorystore in your cloud provider account.

You can update an existing Anyscale cloud to enable fault tolerance by running the following command:

anyscale cloud update --name <cloud-name> --enable-head-node-fault-tolerance

Anyscale provides Terraform modules to help you configure and deploy custom Anyscale clouds on AWS, Google Cloud, and Kubernetes. You can use these scripts to help configure Redis-compatible storage for use with anyscale cloud register.

See Introduction to Anyscale clouds.

important

You can turn off head node fault tolerance for a service in your Anyscale cloud by using the ray_gcs_external_storage_config in the service config, as in the following example:

name: my-service
applications:
  - import_path: main:app
ray_gcs_external_storage_config:
  enabled: False

Disabling fault tolerance using the service config doesn't remove the Redis-compatible cluster from your Anyscale cloud deployment or deprovision resources in your cloud provider. If you need help removing this infrastructure, contact Anyscale support.

Configure alerting for your fault tolerance resources

In the event that the Redis-compatible external storage cluster reaches its maximum memory capacity, your services may experience significant disruptions. Therefore, Anyscale recommends configuring alerts using either AWS CloudWatch or GCP Alerting.

AWS CloudWatch Alerts

Configure an alert on the DatabaseMemoryUsagePercentage metric
Configure the alert condition to trigger if the maximum value exceeds 80%

GCP Alerting

Configure an alert on the Cloud Memorystore Redis Instance - Memory Usage Ratio metric
Configure the alert condition to trigger if the maximum value exceeds 80%

If the alarm is triggered, Anyscale recommends either terminating services to alleviate the memory load or scaling up the Redis-compatible cluster's memory capacity.

Manually configure fault tolerance for an Anyscale service

Anyscale recommends configuring head node fault tolerance at the cloud level. See Enable head node fault tolerance for an Anyscale cloud. When you enable at the cloud level, all services in your Anyscale cloud use the configured Redis-compatible cluster.

Manual configuration is an advanced pattern that's unnecessary for most users. You should only consider manual configuration if your Anyscale service requires a dedicated Redis-compatible cluster. Contact Anyscale support if you need help using this feature.

Use MemoryDB on AWS or Memorystore on Google Cloud.

note

If you configure a custom Redis cluster, you can't use a multi-shard cluster. Anyscale only supports single shard Redis clusters and recommends replication across availability zones.

Requirements

Your Redis-compatible cluster must meet the following requirements:

Created in the same cloud region as your Anyscale cloud.
Created inside the Anyscale-managed VPC using the Anyscale-managed security group (anyscale-security-group).
Have at least 1 replica.
Have at least 1 GiB of storage.
- A 10-node service initially requires around 20 MB of storage. Over time, the usage for a cluster can increase to 100 MB or more.

note

Anyscale doesn't support TLS for Google Cloud Memorystore.

Configure fault tolerance in your service config

Once you have provisioned a Redis-compatible cluster, add the RayGCSExternalStorageConfig config to the ServiceConfig to enable head node fault tolerance, as in the following examples:

YAML
Python

name: my-service
working_dir: .
applications:
  - import_path: main:app
ray_gcs_external_storage_config:
  enabled: True
  address: redis-cluster-hostname:6379
  # Path to TLS certificates if enabled.
  certificate_path: "/etc/ssl/certs/ca-certificates.crt"

from anyscale.service.models import ServiceConfig, RayGCSExternalStorageConfig

config = ServiceConfig(
  name="my-service",
  working_dir=".",
  applications=[{"import_path": "main:app"}],
  ray_gcs_external_storage_config=RayGCSExternalStorageConfig(
    enabled=True,
    address="redis-cluster-hostname:6379",
    certificate_path="/etc/ssl/certs/ca-certificates.crt",
  ),
)

Use the following address pattern for Google Cloud Memorystore: <ip-address>:<port>
Use the following address pattern for AWS MemoryDB: <user-provided-name>.<random-string>.clustercfg.memorydb.<region>.amazonaws.com:6379
If you have TLS enabled, prefix the address with rediss://. For example: rediss://<user-provided-name>.<random-string>.clustercfg.memorydb.<region>.amazonaws.com:6379.
The certificate_path only needs to be updated when using private certificates.

What is head node fault tolerance?​

Enable head node fault tolerance for an Anyscale cloud​

Configure alerting for your fault tolerance resources​

Manually configure fault tolerance for an Anyscale service​

Requirements​

Configure fault tolerance in your service config​