Skip to main content
Version: Latest

Head node fault tolerance

Check your docs version

These docs are for the new Anyscale design. If you started using Anyscale before April 2024, use Version 1.0.0 of the docs. If you're transitioning to Anyscale Preview, see the guide for how to migrate.

important

Head node fault tolerance is only available on customer-hosted clouds. Reach out to preview-help@anyscale.com for info.

Ray Serve is resilient to process and worker node failures out of the box. However, Ray has a centralized component called the Global Control Store (GCS) running on the head node that stores its data in memory by default. If the head crashes or runs out of memory, this data is lost, so the cluster will go down with it and the service will not be able to serve traffic until the cluster is restarted.

To remedy this, enable head node fault tolerance for your service by configuring external storage for the Ray GCS. With head node fault tolerance enabled, the service will continue to serve traffic from Serve replicas running on the worker nodes during head node recovery.

Provisioning GCS storage

In order to enable head node fault tolerance, you must set up a Redis-compatible external storage cluster. There are two ways to accomplish this:

  • Automatic setup (recommended) will provision external storage and configure all services in a cloud to use it by default.
  • Manual setup can be used for more fine-grained control.

You will pay an additional cost to the cloud provider for hosting the external storage cluster. Anyscale does not add an additional charge on top of this cost.

  • On AWS, the MemoryDB cluster size is 2GiB with an additional replica, which costs ~$70/month.
  • On GCP, the Memorystore cluster size is 5GiB with an additional replica, which costs ~$200/month.
note

The Ray GCS currently only supports single shard Redis clusters (which might be replicated across multiple nodes), but does NOT support multi-shard clusters.

Automatic setup

To set a Redis-compatible external storage cluster as the default for all services in a cloud, utilize the cloud setup or cloud register CLI commands. Find detailed instructions on this process in Overview of Anyscale Cloud.

After provisioning the Redis-compatible cluster, the RayGCSExternalStorageConfig parameter of the ServiceConfig will be configured to use it by default for all services in the cloud. You can override the RayGCSExternalStorageConfig field for each service to customize the behavior.

Manual setup

You can also manually provision and configure Redis-compatible external storage cluster. The supported storage solution varies by cloud provider:

External storage requirements. Reach out to Anyscale support with additional questions:

  • Must be in the same cloud region as your Anyscale cloud.
  • Must be created inside the Anyscale-managed VPC using the Anyscale-managed security group (anyscale-security-group).
  • Must have at least 1 replica.
  • TLS must be disabled for Memorystore on GCP.
  • At least 1GiB of storage
    • A 10-node service will initially require ~20 MiB of storage. Over time the usage for a cluster can increase to as much as 100 MiB.

Once the Redis-compatible cluster is provisioned, the RayGCSExternalStorageConfig config must be manually specified in the ServiceConfig to enable head node fault tolerance.

Alerting

In the event that the Redis-compatible external storage cluster reaches its maximum memory capacity, your services may experience significant disruptions. Therefore, we recommend configuring alerts using either AWS CloudWatch or GCP Alerting.

AWS Cloudwatch Alerts

  • Configure an alert on the DatabaseMemoryUsagePercentage metric
  • Configure the alert condition to trigger if the maximum value exceeds 80%

If the alarm is triggered, we recommend either terminating services to alleviate the memory load or scaling up the Redis-compatible cluster's memory capacity.

Configuring external storage for a service

Use the RayGCSExternalStorageConfig field of the ServiceConfig to configure external storage usage for a service.

name: my-service
working_dir: .
applications:
- import_path: main:app
ray_gcs_external_storage_config:
enabled: True
address: redis-cluster-hostname:6379
# Path to TLS certificates if enabled.
certificate_path: "/etc/ssl/certs/ca-certificates.crt"

You can use the enabled field to disable head node FT even if using Automatic setup. If enabled is True and you are using Manual setup, the address must be provided to connect to external storage. You can find the address in the console of the cloud provider you are using.

  • For AWS MemoryDB the address looks like: <user-provided-name>.<random-string>.clustercfg.memorydb.<region>.amazonaws.com:6379
  • For GCP Memorystore the address looks like: <IP>:<PORT>

If TLS is enabled (not currently supported on GCP), the address needs to be prefixed with rediss:// For example: rediss://<user-provided-name>.<random-string>.clustercfg.memorydb.<region>.amazonaws.com:6379. The certificate_path only needs to be updated when using private certificates.