Skip to main content

Autoscaling

Autoscaling is a feature that dynamically scales your deployments up or down based on incoming traffic. When the traffic to your deployment increases, autoscaling adds more replicas of your deployment to handle the traffic. When traffic decreases, autoscaling removes replicas from your deployment to save resources and costs.

By default, autoscaling is disabled, and scaling for your deployment is manually controlled by num_replicas. If you want the Serve Autoscaler to automatically scale the number of replicas based on incoming traffic instead of manually controlling it, you can use this guide to learn how to set up autoscaling for your service.

Autoscaling in Anyscale Services has two components: the Ray Autoscaler and the Ray Serve Autoscaler. They control different aspects of autoscaling and you can configure them separately. Configure the autoscaling config for your Ray Serve deployments, and optionally the Anyscale compute config. To optimize your Service, refer to the documentation provided below, which outlines the steps for configuring both components.

Anyscale compute config vs. Serve autoscaling config

Autoscaling uses two separate Autoscalers: the Ray Autoscaler and Ray Serve Autoscaler.

  • The Ray Serve Autoscaler decides when to add or remove Serve deployment replicas based on the number of queued requests. Replicas are essentially the number of copies of the model running as separate python processes, which you can control through the autoscaling config for each Ray Serve deployment. The number of deployment replicas that can run on a node depends on the resource requirements you define for that deployment. For example, a deployment that requires 0.5 CPUs can run 16 replicas on an 8 CPU worker node.

  • The Ray Autoscaler decides when to add or remove nodes based on resources requested by the Ray Serve Autoscaler.

  • Upscaling: If the number of queued requests exceeds the target value set in the Serve autoscaling config, the Ray Serve Autoscaler adds new replicas. If the cluster doesn't have enough resources to run the new replicas, the Ray Autoscaler provisions a new node.

  • Downscaling: Similarly, if the number of queued requests is below the target value set in the Serve autoscaling config, the Ray Serve Autoscaler removes replicas. If a node no longer has any replicas after scaling down, the Ray Autoscaler tears it down.

Therefore the Ray Autoscaler and Ray Serve Autoscaler work together. The Ray Serve Autoscaler is configured by the Serve autoscaling config, and the Ray Autoscaler is configured through the compute config. Continue reading the next two sections to see how to set the configurations for your use case.

How to set Serve autoscaling config

The Serve autoscaling config defines how the Ray Serve Autoscaler reacts to incoming traffic. This is the configuration that you should tune in order to meet your performance requirements (such as latency or throughput). The autoscaling config defines the following:

  • Target value, which decides what the steady state of your system should look like.
  • Upper and lower autoscaling limits. While the compute config defines the minimum and maximum amount of resources allowed for your system, the Serve autoscaling config defines the minimum and maximum number of copies (called replicas) of your deployment to run.
  • How the system reacts to changes in traffic. Before reaching steady state, your deployment is reacting to traffic shifts, and you can control that using the autoscaling config.

To try out autoscaling, set num_replicas="auto", e.g:

ray_serve_config:
applications:
- name: default
deployments:
- name: Model
num_replicas: "auto"

This will use a default autoscaling configuration that is a good starting point for most applications. Setting num_replicas="auto" is equivalent to setting

  • target_ongoing_requests=2
  • max_ongoing_requests=5
  • min_replicas=1
  • max_replicas=100

To learn more about what target_ongoing_requests and max_ongoing_requests do, continue to the next section.

note

You must turn off manual scaling to use autoscaling. In other words, you cannot set an integer value for num_replicas and a non-null value for autoscaling_config at the same time.

target_ongoing_requests

Serve scales the number of replicas for a deployment up or down based on the average number of ongoing requests per replica. Specifically, Serve compares the actual number of ongoing requests per replica with the target value you set in the autoscaling config and makes upscale or downscale decisions from that. The target value is set by target_ongoing_requests, and Serve makes sure that each replica has roughly that number of requests being processed and waiting in the queue.

To start, you can set this to a reasonable number (for example, 5) and adjust it based on your request processing length (the longer the requests, the smaller this number should be) as well as your latency objective (the shorter you want your latency to be, the smaller this number should be). However, always load test your workloads. For example, if the use case is latency sensitive, you can lower the target_ongoing_requests number to maintain high performance. Benchmark your application code and tune this number based on an end-to-end latency objective.

note

As an example, suppose you have 2 replicas of a synchronous deployment that has 100 ms latency, serving a traffic load of 30 QPS. However, each replica can only serve 10 requests per second, so 3 replicas are needed to handle 30QPS. In other words, Ray Serve assigns requests to replicas faster than the replicas can finish processing them. Ray Serve queues up more and more requests, called "ongoing requests," at the replica as time goes on. The average number of ongoing requests at each replica steadily increases. Latency also increases because new requests have to wait for old requests to finish processing. If you set target_ongoing_requests = 1, Serve detects a higher than desired number of ongoing requests per replica, and adds more replicas. At 3 replicas, your system would be able to process 30 QPS with 1 ongoing request per replica on average.

max_ongoing_requests

Although max_ongoing_requests is not part of the autoscaling config since it is relevant to all deployments, it is still important to set it relative to the target value if autoscaling is turned on for your deployment. max_ongoing_requests defines the maximum queue limit that proxies respect when assigning requests to replicas. We recommend setting max_ongoing_requests to ~20 to 50% higher than target_ongoing_requests. Note that target_ongoing_requests should always be strictly less than max_ongoing_requests, otherwise the deployment never scales up. Take into account the following when setting max_ongoing_requests:

Setting it too low limits upscaling. For instance, if your target value is 50 and max_ongoing_requests is 51, then even if the traffic increases significantly, the requests will queue up at the proxy instead of at the replicas. As a result, the Autoscaler only increases the number of replicas at most 2% at a time, which is very slow.

Setting it too high can lead to imbalanced routing. Concretely, this can lead to very high tail latencies during upscale, because when the Autoscaler is scaling a deployment up due to a traffic spike, most or all of the requests might be assigned to the existing replicas before the new replicas are started.

Autoscaling parameters

This section describes how each parameter in autoscaling_config affects the Serve autoscaling behavior. An example of how to set the autoscaling config for a specific deployment in your service YAML is as follows:

...
ray_serve_config:
applications:
- name: default_application
...
deployments:
- name: Model
autoscaling_config:
target_ongoing_requests: 1
min_replicas: 0
max_replicas: 200
initial_replicas: 0
upscale_delay_s: 3
downscale_delay_s: 60
upscaling_factor: 0.3
downscaling_factor: 0.3
metrics_interval_s: 2
look_pack_period_s: 10
max_ongoing_requests: 3
  • target_ongoing_requests [default=1]: The average number of ongoing requests per replica that the Serve Autoscaler tries to maintain.

  • min_replicas [default=1]: The minimum number of replicas for the deployment. If you want to ensure your system can deal with a certain level of traffic at all times, set min_replicas to a positive number. On the other hand, if you anticipate periods of no traffic and want to scale to zero to reduce costs, set min_replicas = 0. Setting min_replicas = 0 causes higher tail latencies; when you start sending traffic, the deployment scales up, which incurs a cold start time as Serve waits for replicas start serving the request. Note that the min_workers value for each worker node type set in the compute config can constrain downscaling the number of replicas, even if the number of replicas hasn't reached the min_replicas limit yet.

  • max_replicas [default=1]: The maximum number of replicas for the deployment. This value should be greater than min_replicas. Ray Serve Autoscaling relies on the Ray Autoscaler to scale up more nodes when the currently available cluster resources (CPUs, GPUs, etc.) are not enough to support more replicas. Note that the max_workers value set in the compute config can constrain upscaling the number of replicas, even if the number of replicas in your application hasn't reached the max_replicas limit yet.

  • initial_replicas: The number of replicas that Ray Serve starts initially for the deployment. Defaults to the value for min_replicas.

  • upscale_delay_s [default=30]: The amount of time in seconds that Serve waits before scaling up the number of replicas in the deployment. In other words, this parameter controls the frequency of upscale decisions. If the replicas are consistently serving more requests than the target number for upscale_delay_s seconds, then Serve scales up the number of replicas based on aggregated ongoing requests metrics. For example, if your service is likely to experience bursts of traffic, you can lower upscale_delay_s so that your application can react more quickly to increases in traffic.

  • downscale_delay_s [default=600]: The amount of time in seconds that Serve waits before scaling down the number of replicas in your deployment. This parameter controls the frequency of downscale decisions. If the replicas are consistently serving less requests than target number for downscale_delay_s seconds, then Serve scales down the number of replicas based on aggregated ongoing requests metrics. For example, if your application initializes slowly, you can increase downscale_delay_s to make the downscaling happen less frequently and avoid re-initializing when the application needs to upscale again in the future.

  • upscaling_factor [default_value=1.0]: The multiplicative factor to amplify or moderate each upscaling decision. For example, when the application has high traffic volume in a short period of time, you can increase upscaling_factor to scale up the resource quickly. This parameter is like a "gain" factor to amplify the response of the autoscaling algorithm.

  • downscaling_factor [default_value=1.0]: The multiplicative factor to amplify or moderate each downscaling decision. For example, if you want your application to be less sensitive to drops in traffic and scale down more conservatively, you can decrease downscaling_factor to slow down the pace of downscaling.

  • metrics_interval_s [default_value=10]: The time interval in seconds between successive reports of current ongoing requests that each replica sends to the Autoscaler. Note that the Autoscaler can't make new decisions if it doesn't receive updated metrics, so you most likely want to set metrics_interval_s to a value that is less than or equal to the upscale and downscale delay values. For instance, if you set upscale_delay_s = 3, but keep metrics_interval_s = 10, the Autoscaler only upscales roughly every 10 seconds.

  • look_back_period_s [default_value=30]: The window of time over which Ray Serve calculates the average number of ongoing requests per replica.

There is also one more important configuration, max_ongoing_requests, that isn't part of the autoscaling config, but it's important to set it relative to target_ongoing_requests.

  • max_ongoing_requests [default=100]: The maximum queue limit that proxies respect when assigning requests to replicas. Set max_ongoing_requests to ~20 to 50% higher than target_ongoing_requests. Note that max_ongoing_requests should always be strictly higher than target_ongoing_requests. Generally, setting it too low limits upscaling because requests queue at the proxies instead of at the replicas, but setting it too high can lead to imbalanced routing and high tail latencies during upscale.

How to set compute config

In the context of autoscaling, the compute config defines the minimum and maximum hardware resource limits for the Ray Autoscaler to obey. To enable Ray autoscaling for your service, you should have one or more worker node types that can have a variable number of workers. By default for each worker node type, min_workers = 0 and max_workers = 10, so Ray autoscaling is always enabled by default. However, you can use the compute config to specify stricter resource limits if you need to.

For instance, the following compute config defines two worker node types: cpu_worker and gpu_worker. The number of CPU worker nodes can vary anywhere between 2 and 10. The number of GPU worker nodes can vary anywhere between 0 and 10.

...
head_node_type:
name: head_node_type
instance_type: m5.8xlarge
worker_node_types:
- name: cpu_worker
instance_type: m5.8xlarge
min_workers: 2
max_workers: 10
- name: gpu_worker
instance_type: g4dn.4xlarge
min_workers: 0
max_workers: 10

Note that the number of CPU worker nodes is always at least min_workers = 2, regardless of the number of Serve replicas or how low traffic is. Also, the max_workers parameter in your compute config imposes a strict limit, constraining the scaling process.

For more details and examples about how to set the autoscaling config, see the Ray Serve Autoscaling Guide and Ray Serve Advanced Autoscaling Guide.

Replica Compaction

Autoscaling deployments may face resource fragmentation over time as downscaled replicas leave gaps on certain nodes. To save cost, Anyscale offers compact scheduling, which periodically checks for opportunities to compact replicas down to a smaller number of nodes. To enable compact scheduling, set the environment variable RAY_SERVE_USE_COMPACT_SCHEDULING_STRATEGY=1.

Take the following situation as an example. Suppose you have three 8-CPU nodes in your cluster, and the following deployments:

  • Deployment A: 1 CPU per replica
  • Deployment B: 2 CPUs per replica
  • Deployment C: 3 CPUs per replica

And the following replicas running in the cluster:

Node 1 (8/8 CPUs utilized)Node 2 (8/8 CPUs utilized)Node 3 (5/8 CPUs utilized)
5 A replicas4 A replicas2 A replicas
1 C replica2 B replicas1 C replica

Nodes 1 and 2 are fully utilized while only 5 of 8 CPUs are utilized on node 3. Then, suppose deployment A downscales from 11 to 8 replicas, and deployment B downscales from 2 to 1 replica. The cluster now looks like this:

Node 1 (6/8 CPUs utilized)Node 2 (5/8 CPUs utilized)Node 3 (5/8 CPUs utilized)
3 A replicas3 A replicas2 A replicas
1 C replica1 B replica1 C replica

Now that there are available CPUs in nodes 1 and 2, it's possible for the replicas running on node 3 to be migrated to nodes 1 and 2. With compact scheduling enabled, Serve detects that the running replicas can be compacted down to 2 nodes.

INFO deployment_scheduler.py:593 -- Found compactable node 'node3' with migration plan:
{
[Replica(id='1wbyztcn', deployment='deploymentC', app='default')] -> node2,
[Replica(id='nmyumojq', deployment='deploymentA', app='default'), Replica(id='td1gyucw', deployment='deploymentA', app='default')] -> node1
}.

All replicas that were running on the node-to-be-compacted will be migrated safely in a start-then-stop manner. Before each replacement replica is fully running and ready to serve traffic, the old replica will not be affected and will continue to serve traffic.

INFO controller 29720 deployment_state.py:2345 - Migrating Replica(id='1wbyztcn', deployment='deploymentC', app='default') from draining node 'node3'. A new replica will be created on another node.
INFO controller 29720 deployment_state.py:2345 - Migrating Replica(id='nmyumojq', deployment='deploymentA', app='default') from draining node 'node3'. A new replica will be created on another node.
INFO controller 29720 deployment_state.py:2345 - Migrating Replica(id='td1gyucw', deployment='deploymentA', app='default') from draining node 'node3'. A new replica will be created on another node.
INFO controller 29720 deployment_state.py:1955 - Adding 1 replica to Deployment(name='deploymentC', app='default').
INFO controller 29720 deployment_state.py:1955 - Adding 2 replicas to Deployment(name='deploymentA', app='default').
INFO controller 29720 deployment_state.py:411 - Starting Replica(id='thbjs8h6', deployment='deploymentC', app='default').
INFO controller 29720 deployment_state.py:411 - Starting Replica(id='uakn0fws', deployment='deploymentA', app='default').
INFO controller 29720 deployment_state.py:411 - Starting Replica(id='j9j67kvr', deployment='deploymentA', app='default').
INFO controller 29720 deployment_state.py:2086 - Replica(id='thbjs8h6', deployment='deploymentC', app='default') started successfully on node 'node2'.
INFO controller 29720 deployment_state.py:2086 - Replica(id='uakn0fws', deployment='deploymentA', app='default') started successfully on node 'node1'.
INFO controller 29720 deployment_state.py:2086 - Replica(id='j9j67kvr', deployment='deploymentA', app='default') started successfully on node 'node1'.
INFO controller 29720 deployment_state.py:2374 - Stopping Replica(id='1wbyztcn', deployment='deploymentC', app='default') on draining node node3.
INFO controller 29720 deployment_state.py:2374 - Stopping Replica(id='nmyumojq', deployment='deploymentA', app='default') on draining node node3.
INFO controller 29720 deployment_state.py:2374 - Stopping Replica(id='td1gyucw', deployment='deploymentA', app='default') on draining node node3.
INFO controller 29720 proxy_state.py:493 - Draining proxy on node 'node3'.
INFO controller 29720 deployment_scheduler.py:409 - Successfully migrated replicas off of node3.

Finally, this is what the cluster would look like after the compaction process. After the idle timeout time, node 3 will be terminated and the cluster would scale down to 2 nodes.

Node 1 (8/8 CPUs utilized)Node 2 (8/8 CPUs utilized)Node 3 (idle)
3 A replicas3 A replicas
2 A replicas1 B replica
1 C replica1 C replica