Skip to main content
Version: Latest

Production best practices

Check your docs version

These docs are for the new Anyscale design. If you started using Anyscale before April 2024, use Version 1.0.0 of the docs. If you're transitioning to Anyscale Preview, see the guide for how to migrate.

This guide covers best practices for production workloads running on Anyscale Services:

Timeouts and retries

Client-side timeouts and retries

There are a number of things that can go wrong when processing a request to your service, from the application code to the Ray cluster to the load balancer. To minimize user-facing disruptions, it's best practice to always configure end to end retries from the client.

In addition to retries, it's also important to configure timeouts in your client code for two reasons:

  1. This will avoid hard-to-debug hanging behavior in the calling code. For example, in your backend server, you'll see a timeout error instead of latency spikes without a clear source.
  2. Ray Serve does not drop requests by default when it's overloaded, but timing out requests and disconnecting from the client will reduce the load on the service and allow it to keep up with inbound traffic. For this reason, client retries should also use exponential back-off to reduce load when the service cannot respond in time.

Exactly how to implement retries and timeouts is specific to your client code, but below is an example using the Python requests library:

import requests
from requests.adapters import HTTPAdapter, Retry

session = requests.Session()

retries = Retry(
total=5, # 5 retries total
backoff_factor=1, # Exponential back-off
status_forcelist=[ # Retry on server errors
500,
501,
502,
503,
504,
],
)

session.mount("http://", HTTPAdapter(max_retries=retries))

response = session.get("http://localhost:8000/", timeout=10)
result = response.text

Server-side timeouts

In addition to client-side retries and timeouts, you can also configure server-side timeouts as a fallback to avoid overloading the service. There are two layers of server-side timeouts: in Ray Serve and in the cloud's load balancer.

Ray Serve request timeout

To set a timeout for requests in Ray Serve, you can set request_timeout_s in the http_options of the ServiceConfig:

ray_serve_config:
http_options:
request_timeout_s: 10

Load balancer timeouts

Anyscale sets default timeouts depending on the cloud provider you're running on:

  • On AWS, the ALB idle timeout is set to 300 seconds by default. If no data is transferred over the connection after this duration, the connection will be terminated and the client will receive a 504 (Gateway Timeout) response code.
  • On GCP, the backend service timeout is set to 600 seconds by default. If the service doesn't respond to the request after this duration, the connection will be terminated and the client will receive a 408 (Request Timeout) response code.

These configurations are not currently exposed as service configurations. If you encounter an issue and would like to change them for your services, please reach out to Anyscale support.

Load shedding

By default, Ray Serve does not drop requests when overloaded and relies on timeouts for backpressure. This can cause server-side queues to build up and tail latencies to increase under load if clients misbehave. To enable actively dropping requests when queues become too long, you can use the max_queued_requests option exposed by Ray Serve.

Spreading replicas across nodes

To avoid dropping requests when unexpected failures occur (for example, a replica actor or worker node crashes), it's important to configure your application with redundancy. At a minimum, you should ensure that every deployment in your Ray Serve applications has at least two replicas and those replicas are placed on different nodes in the cluster.

By default, Ray Serve will try to pack replicas onto the same nodes as much as possible to save costs. To ensure replicas of a deployment are spread across nodes, you can use the max_replicas_per_node deployment option. See here for more details.

Below is a basic example that configures a deployment to autoscale with at least two replicas that will be placed on different nodes. For more details on how to configure scaling options for your applications, see the Ray Serve autoscaling guide.

from ray import serve

@serve.deployment(
num_replicas="auto",
autoscaling_config={
"min_replicas": 2,
},
max_replicas_per_node=1,
)
class MyApp:
...

app = MyApp.bind()

Head node fault tolerance

Ray handles replica and worker node failures gracefully in all clusters, but if the head node crashes or becomes unresponsive it will cause the cluster to restart. To continue to serve traffic when this happens, configure head node fault tolerance for your services.

Avoid scheduling on the head node

The Ray head node contains a number of important system processes such as the global control store (GCS), API server, and the Ray Serve controller. If a cluster is under heavy load, actors and tasks running on the head node will contend for resources and may interrupt the operation of these system components. When this happens, the service may be unable to serve traffic or recover from failures properly.

To avoid this issue, services will not schedule any actors or tasks on the head node by default. You can override this behavior by explicitly setting available resources in the HeadNodeConfigfor your service, but this is not recommended for production.