Head node fault tolerance
Head node fault tolerance is only available on customer-hosted clouds. Reach out to Anyscale support for info.
Ray Serve is resilient to process and worker node failures out of the box. However, Ray has a centralized component called the Global Control Store (GCS) running on the head node that stores its data in memory by default. If the head crashes or runs out of memory, this data is lost, so the cluster will go down with it and the service will not be able to serve traffic until the cluster is restarted.
To remedy this, enable head node fault tolerance for your service by configuring external storage for the Ray GCS. With head node fault tolerance enabled, the service will continue to serve traffic from Serve replicas running on the worker nodes during head node recovery.
Provisioning GCS storage
In order to enable head node fault tolerance, you must set up a Redis-compatible external storage cluster. There are two ways to accomplish this:
- Automatic setup (recommended) will provision external storage and configure all services in a cloud to use it by default.
- Manual setup can be used for more fine-grained control.
You will pay an additional cost to the cloud provider for hosting the external storage cluster. Anyscale does not add an additional charge on top of this cost.
- On AWS, the MemoryDB cluster size is 2GiB with an additional replica, which costs ~$70/month.
- On GCP, the Memorystore cluster size is 5GiB with an additional replica, which costs ~$200/month.
The Ray GCS only supports single shard Redis clusters (which might be replicated across multiple nodes), but does NOT support multi-shard clusters.
Automatic setup
To set a Redis-compatible external storage cluster as the default for all services in a cloud, utilize the cloud setup
or cloud register
CLI commands.
Find detailed instructions on this process in Overview of Anyscale Cloud.
After provisioning the Redis-compatible cluster, the RayGCSExternalStorageConfig parameter of the ServiceConfig will be configured to use it by default for all services in the cloud. You can override the RayGCSExternalStorageConfig field for each service to customize the behavior.
Manual setup
You can also manually provision and configure Redis-compatible external storage cluster. The supported storage solution varies by cloud provider:
- AWS: MemoryDB.
- GCP: Memorystore.
External storage requirements. Reach out to Anyscale support with additional questions:
- Must be in the same cloud region as your Anyscale cloud.
- Must be created inside the Anyscale-managed VPC using the Anyscale-managed security group (
anyscale-security-group
). - Must have at least 1 replica.
- TLS must be disabled for Memorystore on GCP.
- At least 1GiB of storage
- A 10-node service will initially require ~20 MiB of storage. Over time the usage for a cluster can increase to as much as 100 MiB.
Once the Redis-compatible cluster is provisioned, the RayGCSExternalStorageConfig config must be manually specified in the ServiceConfig to enable head node fault tolerance.
Alerting
In the event that the Redis-compatible external storage cluster reaches its maximum memory capacity, your services may experience significant disruptions. Therefore, we recommend configuring alerts using either AWS CloudWatch or GCP Alerting.
- AWS
- GCP
AWS CloudWatch Alerts
- Configure an alert on the
DatabaseMemoryUsagePercentage
metric - Configure the alert condition to trigger if the maximum value exceeds 80%
GCP Alerting
- Configure an alert on the
Cloud Memorystore Redis Instance - Memory Usage Ratio
metric - Configure the alert condition to trigger if the maximum value exceeds 80%
If the alarm is triggered, we recommend either terminating services to alleviate the memory load or scaling up the Redis-compatible cluster's memory capacity.
Configuring external storage for a service
Use the RayGCSExternalStorageConfig field of the ServiceConfig to configure external storage usage for a service.
- YAML
- Python
name: my-service
working_dir: .
applications:
- import_path: main:app
ray_gcs_external_storage_config:
enabled: True
address: redis-cluster-hostname:6379
# Path to TLS certificates if enabled.
certificate_path: "/etc/ssl/certs/ca-certificates.crt"
from anyscale.service.models import ServiceConfig, RayGCSExternalStorageConfig
config = ServiceConfig(
name="my-service",
working_dir=".",
applications=[{"import_path": "main:app"}],
ray_gcs_external_storage_config=RayGCSExternalStorageConfig(
enabled=True,
address="redis-cluster-hostname:6379",
certificate_path="/etc/ssl/certs/ca-certificates.crt",
),
)
You can use the enabled
field to disable head node FT even if using Automatic setup.
If enabled
is True
and you are using Manual setup, the address
must be provided to connect to external storage.
You can find the address in the console of the cloud provider you are using.
- For AWS MemoryDB the address looks like:
<user-provided-name>.<random-string>.clustercfg.memorydb.<region>.amazonaws.com:6379
- For GCP Memorystore the address looks like:
<IP>:<PORT>
If TLS is enabled (not supported on GCP), the address needs to be prefixed with rediss://
For example: rediss://<user-provided-name>.<random-string>.clustercfg.memorydb.<region>.amazonaws.com:6379
.
The certificate_path
only needs to be updated when using private certificates.