Skip to main content

Head node fault tolerance

note

Head node fault tolerance with Anyscale Services requires Ray 2.3+.

Ray and Ray Serve are resilient to process and worker node failures out of the box. However, Ray has a centralized component called the Global Control Store (GCS) running on the head node. By default, if this component goes down (as in the head node crashes or runs out of memory), the cluster will go down with it. For a service, this means it won't be able to serve traffic until Anyscale restarts the cluster and redeploys the application.

To remedy this, enable head node fault tolerance for your service by configuring external storage for the Ray GCS. With head node fault tolerance enabled, the service will continue to serve traffic from Serve replicas running on the worker nodes during head node downtime.

Provisioning GCS storage

In order to enable head node fault tolerance, you must set up a Redis-compatible external storage cluster. The Redis can be automatically configured and setup for all your services within the cloud using Automatic Setup. You can also setup your own Redis and configure your services manually using Manual Setup. For cost efficiency and a streamlined approach to enable head node fault tolerance we recommend using the Automatic Setup approach. However, for critical workloads where you need service isolation, we recommend using the Manual Setup approach to provision a standalone Redis cluster for each Anyscale Service.

note

You will pay an additional cost to the cloud provider for hosting the external storage cluster. On AWS, the MemoryDB cluster size is 2GiB with an additional replica, which costs ~$70/month. On GCP, the Memorystore cluster size is 5GiB with an additional replica, which costs ~$200/month. Anyscale does not add an additional charge on top of this cost. TLS support is currently only enabled for MemoryDB on AWS. Anyscale will support TLS for Memorystore on GCP in the near future. Please note that GCS stores only Ray metadata and contains no customer information. Additionally, please note that Ray GCS currently only supports a single shard Redis clusters (which might be replicated across multiple nodes), but does NOT support multi-shard clusters.

Automatic Setup

To set a Redis-compatible external storage cluster as the default for all Anyscale Services, utilize the cloud setup or cloud register CLI commands. Detailed instructions on this process can be found in our cloud deployment documentation.

Upon provisioning of the Redis-compatible cluster, the ray_gcs_external_storage_config parameter of the Service YAML is automatically set for all services within the Anyscale Cloud.

If you need to customize the Redis-compatible configuration, manually specifying the ray_gcs_external_storage_config in the Service YAML will override these settings.

Manual Setup

A Redis-compatible external storage cluster can also be manually setup and configured for Anyscale Services. The supported storage solution varies by cloud provider:

External storage requirements (please reach out to Anyscale support for additional help):

  • Must be in the same cloud region as your Anyscale cloud.
  • Must be created inside the Anyscale-managed VPC using the Anyscale-managed security group (anyscale-security-group).
  • Must have at least 1 replica
  • TLS must be disabled for Memorystore on GCP
  • We recommend setting up the Redis-compatible cluster with at least 1GiB of storage
    • A 10-node service will initially require approximately 20 MiB of storage. However, over time, the usage for such a cluster can increase to as much as 100 MiB due to autoscaling events

Once the Redis-compatible cluster is provisioned, the ray_gcs_external_storage_config config must be manually configured in the Service YAML to enable head node fault tolerance.

Alerting

In the event that the Redis-compatible external storage cluster reaches its maximum memory capacity, your services may experience significant disruptions. Therefore, we recommend configuring alerts using either AWS CloudWatch or GCP Alerting.

AWS Cloudwatch Alerts

  • Configure an alert on the DatabaseMemoryUsagePercentage metric
  • Configure the alert condition to trigger if the maximum value exceeds 80%

If the alarm is triggered, we recommend either terminating services to alleviate the memory load or scaling up the Redis-compatible cluster's memory capacity.

Configuring external GCS storage

Refer to the example below or Configure Anyscale Service to see how to configure the ray_gcs_external_storage_config config:

my_production_service.yaml
name: my-first-service
ray_serve_config:
applications:
- name: default
import_path: serve_hello:entrypoint
runtime_env:
working_dir: https://github.com/anyscale/docs_examples/archive/refs/heads/main.zip
ray_gcs_external_storage_config:
address: <REDIS DB HOSTNAME:PORT>
enable: True
redis_certificate_path: "/etc/ssl/certs/ca-certificates.crt" # Path to Certificate if TLS is enabled for Redis

The address is a mandatory field if enable is True; otherwise, it is optional.

  • For MemoryDB, AWS provides as a DNS name that looks like:
    <user-provided-name>.<random-string>.clustercfg.memorydb.<region>.amazonaws.com:6379
  • For Memorystore, GCP provides a IP address and port that looks like:
    <IP>:<PORT>
important

If TLS is enabled, the address needs to be prefixed with rediss:// For example: rediss://user-provided-name>.<random-string>.clustercfg.memorydb.<region>.amazonaws.com:6379 or rediss://<IP>:<PORT>

You can set the enable field to False in the configuration to deactivate GCS FT for the service. The redis_certificate_path field is an optional path that specifies the path for the CA certificate. This path is defaulted by Anyscale to the public certificate at /etc/ssl/certs/ca-certificates.crt and does not need to be updated unless using private certificates.

Recovery procedure

In the event of a head node failure, the service maintains functionality by redirecting traffic to the worker nodes. Important considerations:

  • Temporary Request Failures: There may be a brief period during which requests fail. This occurs while the load balancer updates it's list of active and healthy backends, usually resolved within 30 seconds. Implementing client-side retries is a recommended best practice to counteract this.
  • Limitations During Downtime: While the head node is down, capabilities such as updates, autoscaling, and failure recovery (for instance, from a worker node crash) are suspended.
  • Redundancy Requirements: Ensure your application has sufficient redundancy to handle requests solely from worker nodes, avoiding single points of failure.

Recovery process upon a head node failure

  1. Provisioning of New Head Node: The Anyscale Control Plane starts provisioning a new head node immediately.
  2. Time to Come Online: The time for the new head node to become operational is dependent on the instance type (and availability), the size of your Cluster Environment image, and your application configuration. Typically, this duration aligns with the startup time of the service worker nodes.
  3. Restoration of Full Services: Once the new head node is ready, it connects with the existing worker nodes. This re-establishes the full functionality of the Service, returning to normal operation.

The entire recovery process generally completes within a few minutes.