Skip to main content

Production best practices

This page outlines best practices to maximize stability for services running production workloads.

  1. Retries and timeouts
  2. Avoid scheduling actors on the head node
  3. Avoid single points of failure

Retries and timeouts

Client-side retries and timeouts

There are a number of things that can go wrong when processing a request to your service, from the application code to the Ray cluster to the load balancer. To minimize user-facing disruptions from any problems that might occur, it's best practice to be defensive by always using end to end retries from the client.

In addition to retries, it's also important to configure timeouts in your client code for two reasons:

  1. This will avoid hard-to-debug hanging behavior in the calling code (for example, in your backend server).
  2. Ray Serve does not drop requests by default when it's overloaded, but timing out requests and disconnecting from the client will reduce the load on the service and allow it to keep up with inbound traffic. For this reason, client retries should also use exponential back-off to reduce load when the service cannot respond in time.

Exactly how to implement retries and timeouts is specific to your client code, but below is an example using the Python requests library:

import requests
from requests.adapters import HTTPAdapter, Retry

session = requests.Session()

retries = Retry(
total=5, # 5 retries total
backoff_factor=1, # Exponential back-off
status_forcelist=[ # Retry on server errors

session.mount("http://", HTTPAdapter(max_retries=retries))

response = session.get("http://localhost:8000/", timeout=10)
result = response.text

Server-side timeouts

In addition to client-side retries and timeouts, you can also configure server-side timeouts as a fallback to avoid overloading the service. There are two layers of timeouts that can be configured: in Ray Serve and in the load balancer.

Ray Serve HTTP request timeout


The request_timeout_s field is only supported in ray >= 2.6.

To set a timeout for HTTP requests in Ray Serve, you can use the request_timeout_s field of http_options:

request_timeout_s: 10

If the request timeout is reached, Ray Serve will return a 408 (Request Timeout) response code. Along with a message "Request (request id) timed out after (timeout)s."

Load balancer timeouts

Anyscale sets default timeouts depending on the cloud provider you're running on:

  • On AWS, the ALB idle timeout is set to 300 seconds by default. If no data is transferred over the connection after this duration, the connection will be terminated.
  • On GCP, the backend service timeout is set to 600 seconds by default. If the service doesn't respond to the request after this duration, the connection will be terminated.

These configurations are not currently exposed as service configurations. If you encounter an issue and would like to change them for your services, please reach out to Anyscale support.


If the AWS ALB idle timeout is reached, the load balancer will return a 504 (Gateway Timeout) response code. If the GCP backend service timeout is reached, the load balancer will return a 408 (Request Gateway) response code.

Avoid scheduling actors and tasks on the head node

The Ray head node contains a number of important system processes such as the global control store (GCS), API server, and the Ray Serve controller. If a cluster is under heavy load, actors and tasks running on the head node will contend for resources and may cause these system components to hang or crash. When this happens, the service may be unable to serve traffic or recover from failures properly.

To avoid this issue, prevent Ray from scheduling tasks and actors on the head node by setting its resource availability to zero in your compute config:

name: head_node_type
instance_type: m5.2xlarge
cpu: 0
gpu: 0

Avoid single points of failure

To avoid dropping requests when unexpected failures occur (for example, a replica actor or worker node crashes), it's important to configure your application with redundancy. This is especially important if you configure your clusters to use spot instances with the spot-to-on-demand feature. At a minimum, you should ensure that every deployment in your Ray Serve applications has at least two replicas and those replicas are placed on different nodes in the cluster.

Configuring multiple replicas in Serve deployments

To configure your application to have multiple replicas, set num_replicas >= 2 in the Serve application code or config (or if you're using autoscaling, make sure that min_replicas >= 2). For more details, see the Ray Serve documentation on scaling out a deployment.

class Deployment:

Spreading replicas across nodes


We plan to make configuring redundancy easier and more automated in a future Ray release. If you have questions about the right configuration for your service, please reach out to Anyscale support.

By default, Ray Serve will try to spread the replicas of a deployment across available nodes in the cluster. In some cases, replicas may still be placed on the same node. To ensure replicas of a deployment are spread across nodes, you can use the max_replicas_per_node deployment option. See here for more details.

Configuring head node fault tolerance

Ray handles replica and worker node failures gracefully by default, but if the head node crashes or becomes unresponsive it will cause the cluster to restart. In order to continue to serve traffic when this happens, configure head node fault tolerance for your services.