Scale a service

Increasing the number of replicas in your deployments, either manually on your own or by configuring autoscaling, can increase the throughput of your service. This guide covers:

How to manually scale out your service
How to set up autoscaling for your service
How to save compute resources by enabling replica compaction

Manual scaling

You can manually scale your service by increasing or decreasing the value num_replicas in your deployment.

For example, suppose you are serving an ML model with 2 replicas.

from ray import serve

@serve.deployment(num_replicas=2)
class MLModel:
    ...

However, you may want to achieve lower latencies to serve your increased user base. You can redeploy the model with, say, 5 replicas to improve performance:

@serve.deployment(num_replicas=5)
class MLModel:
    ...

By default, the value num_replicas is set to 1.

Autoscaling

Configure autoscaling to dynamically scale the number of replicas in your deployments up or down in response to incoming traffic. This allows your models to handle variable traffic loads while saving costs on idle compute resources.

Set up autoscaling for a service

To try out autoscaling:

Set num_replicas="auto" in your Serve deployment config. This uses a set of default autoscaling configurations.
Define one or more worker node types in your compute config that can have a variable number of workers. By default for each worker node type, min_nodes = 0 and max_nodes = 10, so autoscaling is enabled by default for a worker node type, but you can further configure these limits for your needs.

Serve autoscaling config vs Anyscale compute config

The autoscaling config configures how the number of replicas in a Serve deployment upscales or downscales. Serve decides when to add or remove replicas, and how many, based on the traffic load and the autoscaling config.
The compute config configures the types of nodes in the cluster, and the minimum and maximum number of nodes the cluster can scale to.

The link here is that if Serve decides to add new replicas and the cluster doesn't have enough resources to run the new replicas, a new node from the compute config will be provisioned (taking into account maximum hardware resource limits defined in the compute config). Similarly, if Serve decides to remove replicas and a node no longer has any replicas after scaling down, the node will be teared down (taking into account minimum hardware resource limits).

Deployment config
Compute config

from ray import serve

@serve.deployment(num_replicas="auto")
class MLModel:
    ...

head_node:
  instance_type: m5.8xlarge
worker_nodes:
  - name: cpu-worker
    instance_type: m5.8xlarge
    min_nodes: 2
    max_nodes: 10

How autoscaling works

Let's dive deeper into how Serve makes autoscaling decisions for a deployment.

Ray Serve uses request based autoscaling, meaning it decides whether to increase or decrease the number of replicas, and by how much, by comparing the actual number of ongoing requests per replica with target_ongoing_requests. This value, which is set in the autoscaling_config, should be decided based on your latency objectives. For example, if your use case is latency sensitive, you can lower the target_ongoing_requests number to maintain high performance.

The number of requests that get assigned to any single replica is capped by max_ongoing_requests. It follows naturally that max_ongoing_requests should be set relative to target_ongoing_requests. The higher this value is, the more likely requests that exceed the target value will queue up at the replicas (as opposed to at the proxies). This benefit of this is that these requests can take advantage of concurrency in your deployment. However, it can also lead to imbalanced routing and higher tail latencies during upscale, because during bursts of traffic a large number of requests can get assigned to overloaded replicas, meaning new replicas aren't assigned any requests.

Serve will upscale or downscale to the newly determined replica target, but caps the number of running replicas to between min_replicas and max_replicas.

Autoscaling config parameters

Beyond defining the target number of ongoing requests per replica, which largely controls how your deployment performs in steady state, there are more autoscaling parameters that gives finer control over how your deployment reacts to changes in traffic. For instance, you can change the time window over which Serve averages request metrics, or limit how often autoscaling decisions are made by adjusting the upscale/downscale delays.

from ray import serve

@serve.deployment(
    num_replicas="auto",
    autoscaling_config={
        "upscale_delay_s": 60,
    }
)
class MLModel:
    ...

For a full list of autoscaling config parameters, details about how each of them affects the autoscaling behavior, and more example autoscaling applications, see the Ray Serve Autoscaling Guide and Ray Serve Advanced Autoscaling Guide.

Save resources with replica compaction

As deployments scale up and down, they can face resource fragmentation over time when downscaled replicas leave gaps on certain nodes. To save cost, Anyscale offers replica compaction, which periodically checks for opportunities to compact replicas down to a smaller number of nodes. Replica compaction is enabled by default. To turn it off, use the following environment variable in your container image RAY_SERVE_USE_COMPACT_SCHEDULING_STRATEGY=0.

Take the following situation as an example. Suppose you have three 8-CPU nodes in your cluster, and the following deployments:

Deployment A: 1 CPU per replica
Deployment B: 2 CPUs per replica
Deployment C: 3 CPUs per replica

And the following replicas running in the cluster:

Node 1 (8/8 CPUs utilized)	Node 2 (8/8 CPUs utilized)	Node 3 (5/8 CPUs utilized)
5 A replicas	4 A replicas	2 A replicas
1 C replica	2 B replicas	1 C replica

Then, suppose deployment A downscales from 11 to 8 replicas, and deployment B downscales from 2 to 1 replica. The cluster now looks like this:

Node 1 (6/8 CPUs utilized)	Node 2 (5/8 CPUs utilized)	Node 3 (5/8 CPUs utilized)
3 A replicas	3 A replicas	2 A replicas
1 C replica	1 B replica	1 C replica

Now that there are available CPUs on nodes 1 and 2, it's possible for the replicas running on node 3 to be migrated to nodes 1 and 2. With replica compaction enabled, Serve detects that the running replicas can be compacted down to 2 nodes.

INFO deployment_scheduler.py:593 -- Found compactable node 'node3' with migration plan:
{
  [Replica(id='1wbyztcn', deployment='deploymentC', app='default')] -> node2,
  [Replica(id='nmyumojq', deployment='deploymentA', app='default'), Replica(id='td1gyucw', deployment='deploymentA', app='default')] -> node1
}.

All replicas that were running on the node-to-be-compacted will be migrated safely in a start-then-stop manner. Before each replacement replica is fully running and ready to serve traffic, the old replica will not be affected and will continue to serve traffic.

INFO controller 29720 deployment_state.py:2345 - Migrating Replica(id='1wbyztcn', deployment='deploymentC', app='default') from draining node 'node3'. A new replica will be created on another node.
INFO controller 29720 deployment_state.py:2345 - Migrating Replica(id='nmyumojq', deployment='deploymentA', app='default') from draining node 'node3'. A new replica will be created on another node.
INFO controller 29720 deployment_state.py:2345 - Migrating Replica(id='td1gyucw', deployment='deploymentA', app='default') from draining node 'node3'. A new replica will be created on another node.
INFO controller 29720 deployment_state.py:1955 - Adding 1 replica to Deployment(name='deploymentC', app='default').
INFO controller 29720 deployment_state.py:1955 - Adding 2 replicas to Deployment(name='deploymentA', app='default').
INFO controller 29720 deployment_state.py:411 - Starting Replica(id='thbjs8h6', deployment='deploymentC', app='default').
INFO controller 29720 deployment_state.py:411 - Starting Replica(id='uakn0fws', deployment='deploymentA', app='default').
INFO controller 29720 deployment_state.py:411 - Starting Replica(id='j9j67kvr', deployment='deploymentA', app='default').
INFO controller 29720 deployment_state.py:2086 - Replica(id='thbjs8h6', deployment='deploymentC', app='default') started successfully on node 'node2'.
INFO controller 29720 deployment_state.py:2086 - Replica(id='uakn0fws', deployment='deploymentA', app='default') started successfully on node 'node1'.
INFO controller 29720 deployment_state.py:2086 - Replica(id='j9j67kvr', deployment='deploymentA', app='default') started successfully on node 'node1'.
INFO controller 29720 deployment_state.py:2374 - Stopping Replica(id='1wbyztcn', deployment='deploymentC', app='default') on draining node node3.
INFO controller 29720 deployment_state.py:2374 - Stopping Replica(id='nmyumojq', deployment='deploymentA', app='default') on draining node node3.
INFO controller 29720 deployment_state.py:2374 - Stopping Replica(id='td1gyucw', deployment='deploymentA', app='default') on draining node node3.
INFO controller 29720 proxy_state.py:493 - Draining proxy on node 'node3'.
INFO controller 29720 deployment_scheduler.py:409 - Successfully migrated replicas off of node3.

Finally, this is what the cluster would look like after the compaction process. After the idle timeout time, node 3 will be terminated and the cluster would scale down to 2 nodes.

Node 1 (8/8 CPUs utilized)	Node 2 (8/8 CPUs utilized)	Node 3 (idle)
3 A replicas	3 A replicas
2 A replicas	1 B replica
1 C replica	1 C replica

Manual scaling​

Autoscaling​

Set up autoscaling for a service​

How autoscaling works​

Autoscaling config parameters​