Head node fault tolerance
Head node fault tolerance is only available on customer-hosted clouds. Reach out to Anyscale support for info.
Ray Serve is resilient to process and worker node failures out of the box. However, Ray has a centralized component called the Global Control Store (GCS) running on the head node that stores its data in memory by default. If the head crashes or runs out of memory, this data is lost, so the cluster will go down with it and the service will not be able to serve traffic until the cluster is restarted.
To remedy this, enable head node fault tolerance for your service by configuring external storage for the Ray GCS. With head node fault tolerance enabled, the service will continue to serve traffic from Serve replicas running on the worker nodes during head node recovery.
Provisioning GCS storage
In order to enable head node fault tolerance, you must set up a Redis-compatible external storage cluster. There are two ways to accomplish this:
- Automatic setup (recommended) will provision external storage and configure all services in a cloud to use it by default.
- Manual setup can be used for more fine-grained control.
You will pay an additional cost to the cloud provider for hosting the external storage cluster. Anyscale does not add an additional charge on top of this cost.
- On AWS, the MemoryDB cluster size is 2GiB with an additional replica, which costs ~$70/month.
- On GCP, the Memorystore cluster size is 5GiB with an additional replica, which costs ~$200/month.
The Ray GCS only supports single shard Redis clusters (which might be replicated across multiple nodes), but does NOT support multi-shard clusters.
Automatic setup
To set a Redis-compatible external storage cluster as the default for all services in a cloud, utilize the cloud setup
or cloud register
CLI commands.
Find detailed instructions on this process in Overview of Anyscale Cloud.
After provisioning the Redis-compatible cluster, the RayGCSExternalStorageConfig parameter of the ServiceConfig will be configured to use it by default for all services in the cloud. You can override the RayGCSExternalStorageConfig field for each service to customize the behavior.
Manual setup
You can also manually provision and configure Redis-compatible external storage cluster. The supported storage solution varies by cloud provider:
- AWS: MemoryDB.
- GCP: Memorystore.
External storage requirements. Reach out to Anyscale support with additional questions:
- Must be in the same cloud region as your Anyscale cloud.
- Must be created inside the Anyscale-managed VPC using the Anyscale-managed security group (
anyscale-security-group
). - Must have at least 1 replica.
- TLS must be disabled for Memorystore on GCP.
- At least 1GiB of storage
- A 10-node service will initially require ~20 MiB of storage. Over time the usage for a cluster can increase to as much as 100 MiB.
Once the Redis-compatible cluster is provisioned, the RayGCSExternalStorageConfig config must be manually specified in the ServiceConfig to enable head node fault tolerance.