Configure head node fault tolerance
This page provides an overview of configure head node fault tolerance for Anyscale services. Anyscale recommends configuring head node fault tolerance for all Anyscale clouds that use services in production.
You can't configure head node fault tolerance for serverless Anyscale clouds (also called Anyscale-hosted clouds).
When you enable head node fault tolerance, Anyscale deploys additional resources in your cloud provider account that have ongoing costs. While Anyscale doesn't charge for these resources, they incur charges in your cloud provider account even if you aren't using Anyscale services.
- Anyscale clouds on AWS use MemoryDB with 2 GiB of memory and a replica. See Amazaon MemoryDB pricing.
- Anyscale clouds on Google Cloud use Memorystore with 5 GiB of memory and a replica. See Memorystore for Redis Cluster pricing.
If you need help disabling head node fault tolerance, contact Anyscale support.
What is head node fault tolerance?
Head node fault tolerance uses a Redis-compatible external storage cluster to prevent service outages due to head node instability, out-of-memory issues, or machine failure.
With head node fault tolerance enabled, Anyscale services can continue to serve responses using replicas on worker nodes during head node recovery.
Enable head node fault tolerance for an Anyscale cloud
Anyscale recommends that you enable head node fault tolerance at the cloud level. When you enable head node fault tolerance for an Anyscale cloud, all services deployed in that cloud use the feature by default.
If you use anyscale cloud setup
to deploy your Anyscale cloud, you can enable head node fault tolerance during cloud deployment by including the flag --enable-head-node-fault-tolerance
. Anyscale automatically configures MemoryDB or Memorystore in your cloud provider account.
You can update an existing Anyscale cloud to enable fault tolerance by running the following command:
anyscale cloud update --name <cloud-name> --enable-head-node-fault-tolerance
Anyscale provides Terraform modules to help you configure and deploy custom Anyscale clouds on AWS, Google Cloud, and Kubernetes. You can use these scripts to help configure Redis-compatible storage for use with anyscale cloud register
.
See Introduction to Anyscale clouds.
You can turn off head node fault tolerance for a service in your Anyscale cloud by using the ray_gcs_external_storage_config
in the service config, as in the following example:
name: my-service
applications:
- import_path: main:app
ray_gcs_external_storage_config:
enabled: False
Disabling fault tolerance using the service config doesn't remove the Redis-compatible cluster from your Anyscale cloud deployment or deprovision resources in your cloud provider. If you need help removing this infrastructure, contact Anyscale support.
Configure alerting for your fault tolerance resources
In the event that the Redis-compatible external storage cluster reaches its maximum memory capacity, your services may experience significant disruptions. Therefore, Anyscale recommends configuring alerts using either AWS CloudWatch or GCP Alerting.
- AWS
- GCP
AWS CloudWatch Alerts
- Configure an alert on the
DatabaseMemoryUsagePercentage
metric - Configure the alert condition to trigger if the maximum value exceeds 80%
GCP Alerting
- Configure an alert on the
Cloud Memorystore Redis Instance - Memory Usage Ratio
metric - Configure the alert condition to trigger if the maximum value exceeds 80%
If the alarm is triggered, Anyscale recommends either terminating services to alleviate the memory load or scaling up the Redis-compatible cluster's memory capacity.
Manually configure fault tolerance for an Anyscale service
Anyscale recommends configuring head node fault tolerance at the cloud level. See Enable head node fault tolerance for an Anyscale cloud. When you enable at the cloud level, all services in your Anyscale cloud use the configured Redis-compatible cluster.
Manual configuration is an advanced pattern that's unnecessary for most users. You should only consider manual configuration if your Anyscale service requires a dedicated Redis-compatible cluster. Contact Anyscale support if you need help using this feature.
Use MemoryDB on AWS or Memorystore on Google Cloud.
If you configure a custom Redis cluster, you can't use a multi-shard cluster. Anyscale only supports single shard Redis clusters and recommends replication across availability zones.
Requirements
Your Redis-compatible cluster must meet the following requirements:
- Created in the same cloud region as your Anyscale cloud.
- Created inside the Anyscale-managed VPC using the Anyscale-managed security group (
anyscale-security-group
). - Have at least 1 replica.
- Have at least 1 GiB of storage.
- A 10-node service initially requires around 20 MB of storage. Over time the usage for a cluster can increase to 100 MB or more.
Anyscale doesn't support TLS for Google Cloud Memorystore.
Configure fault tolerance in your service config
Once you have provisioned a Redis-compatible cluster, add the RayGCSExternalStorageConfig config to the ServiceConfig to enable head node fault tolerance, as in the following examples:
- YAML
- Python
name: my-service
working_dir: .
applications:
- import_path: main:app
ray_gcs_external_storage_config:
enabled: True
address: redis-cluster-hostname:6379
# Path to TLS certificates if enabled.
certificate_path: "/etc/ssl/certs/ca-certificates.crt"
from anyscale.service.models import ServiceConfig, RayGCSExternalStorageConfig
config = ServiceConfig(
name="my-service",
working_dir=".",
applications=[{"import_path": "main:app"}],
ray_gcs_external_storage_config=RayGCSExternalStorageConfig(
enabled=True,
address="redis-cluster-hostname:6379",
certificate_path="/etc/ssl/certs/ca-certificates.crt",
),
)
- Use the following address pattern for Google Cloud Memorystore:
<ip-address>:<port>
- Use the following address pattern for For AWS MemoryDB:
<user-provided-name>.<random-string>.clustercfg.memorydb.<region>.amazonaws.com:6379
- If you have TLS enabled, prefix the address with
rediss://
. For example:rediss://<user-provided-name>.<random-string>.clustercfg.memorydb.<region>.amazonaws.com:6379
. - The
certificate_path
only needs to be updated when using private certificates.