Skip to main content

Tune parameters for LLMs on Anyscale services

This page helps you optimize vLLM and Ray Serve parameters for your specific workload requirements, including instructions for configuring memory management, inference optimization, and autoscaling for long context, high concurrency, or large model deployments.

Memory management parameters in vLLM

To deploy vLLM successfully, you must fit your model and KV cache within available GPU memory. Use these parameters to control memory allocation and performance optimization.

note

For complete vLLM engine parameter reference, see the vLLM documentation.

For detailed memory calculations, see Choose a GPU for LLM serving.

ParameterDescriptionBest practices
tensor_parallel_sizeSplits transformer layers across GPUs on same node (horizontal sharding)Set to the number of GPUs when your model exceeds single-GPU capacity. Requires high-speed interconnect.
pipeline_parallel_sizeVertically partitions model layers across nodesCombine with tensor_parallel_size for models requiring >8 GPUs. Keep at 1 for single-node models.
max_model_lenMaximum context length (prompt + generated tokens) for memory pre-allocationReduce for short-form tasks to save memory. Leave unset to use the model's native context window.
gpu_memory_utilizationFraction (0.0-1.0) of GPU memory for model and KV cacheDefault: 0.9. Lower when you share GPU usage. Raise to 0.95 for more cache space.
max_num_seqsMaximum concurrent sequences in active batchUpper limit only (no pre-allocation). Set slightly above your estimated concurrency. Avoid extreme values to prevent OOM.
warning

Important: max_num_seqs is an upper limit, not actual concurrency. vLLM determines the true concurrency dynamically based on available KV cache.

  • Too low (one to two): Underutilizes GPU
  • Too high (2000+): Causes OOM errors
  • Recommended: 128-256 for most workloads

Check your deployment logs for actual concurrency values.

Optimization scenarios

Scenario 1: Optimize for high concurrency

  • Goal: Serve a large number of simultaneous users to maximize throughput.
  • Constraint: You must support a large concurrency value.
  • Trade-offs: To support many sequences, each one must consume as little memory as possible, which means you must limit context length. If you don't want to reduce the context length then you have to consider increasing the GPU memory pool or counts.

The following table explains which parameters to tune:

ParameterExplanation
max_model_lenThis is the most effective way to save memory. If your workload involves short Q&A or chat, set a lower, realistic context length (for example, 1024 or 2048) dramatically shrinks the memory footprint of each sequence. This allows more sequences to fit in a batch, increasing throughput.
gpu_memory_utilizationIf your current concurrency is close to your target but just slightly under, consider increasing this value from the default 0.9 to 0.95. This allocates more GPU memory for the KV cache, allowing you to support more concurrent sequences.
tensor_parallel_size and pipeline_parallel_sizeIf you need both high concurrency and a moderately long context, your only solution is to add more hardware. Using more GPUs increases the total memory available for the KV cache, accommodating more sequences.

Scenario 2: Load a very large model

  • Goal: Load a model that's too big for a single GPU.
  • Constraint: The model weights themselves are the source of memory pressure, even before processing requests.
  • Trade-offs: You must deploy more hardware. Tuning other parameters has no impact unless the cluster can load the model.

The following table explains which parameters to tune:

ParameterExplanation
tensor_parallel_sizeSet to the number of available GPUs on the node (for example, 2, 4, 8) to shard the model's layers.
pipeline_parallel_sizeFor models that don't fit on a single 8-GPU node, you must also use pipeline parallelism to shard the model vertically across different nodes.

Scenario 3: Optimize for long context

  • Goal: Process requests with very long context windows.
  • Constraint: You can't significantly reduce max_model_len.
  • Trade-offs: To support a few memory-heavy long sequences, you must choose between reducing concurrency or using more GPUs to increase the total memory available for the KV cache and accommodate longer sequences.

The following table explains which parameters to tune:

ParameterExplanation
gpu_memory_utilizationIf you are just short of supporting your desired context length, increase this value from the default 0.9 to 0.95. This dedicates more GPU memory to the KV cache, enabling support for longer sequences.
tensor_parallel_size and pipeline_parallel_sizeIf raising gpu_memory_utilization to 0.95 and lowering max_num_seqs to 1 still results in OOM errors, you must add more GPUs. Increase tensor parallelism or pipeline parallelism to expand the total GPU memory pool for the large KV cache required by long contexts.

Ray Serve autoscaling configuration

Autoscaling enables your deployment to dynamically adjust replicas based on traffic patterns, which optimizes performance and cost.

note

For detailed autoscaling documentation, see:

Replica count limits

ParameterDescriptionBest practices
min_replicasMinimum replica countSet to 0 for cost optimization (with cold-start trade-off). Set to >0 for warm standby.
max_replicasMaximum replica countMust exceed min_replicas. Size for peak traffic plus headroom. Verify cluster capacity.

Steady-state targets

ParameterDescriptionBest practices
max_ongoing_requestsMaximum in-flight requests per replicaSet equal to max_num_seqs. Keep 20—50% above target_ongoing_requests
target_ongoing_requestsTarget average in-flight requestsLower for latency-sensitive apps. Raise for throughput-oriented workloads. Validate with load testing
note

By default, Ray Serve LLM may set max_ongoing_requests and target_ongoing_requests to very large numbers (for example, 1e9). Use the following guidelines to set appropriate values for effective autoscaling:

  1. Find max concurrency: When you deploy, check the logs for the maximum concurrency supported by the KV cache

    • For example, with context length 8192, logs might show a max concurrency of ~47.
  2. Estimate effective batch size: Based on your average request token length, calculate the number of requests that can be handled concurrently.

    • For example, if the model supports 47 sequences at 8192 tokens each, but your average input request is ~1000 tokens, the effective batch size is approximately (8192 / 1000) * 47 ≈ 385 requests.
  3. Set hard caps: Set the max_num_seqs parameter in vLLM and the max_ongoing_requests parameter in Ray Serve to the calculated value.

    • For example, 385.
  4. Set autoscaling target: Set target_ongoing_requests to be 20-50% lower than the hard cap to provide a buffer for scaling.

    • For example, 300.
  5. Observe and adjust: Monitor your application's autoscaling behavior under load and fine-tune these parameters to meet your performance and cost requirements.

Traffic reaction parameters

ParameterDescriptionBest practices
upscale_delay_sDelay before scaling up (default: 30s)Lower for faster burst response (might cause oscillations).
downscale_delay_sDelay before scaling down (default: 600s)Raise to maintain warm capacity. Lower to prioritize cost savings.
upscaling_factorScale-up multiplier (default: 1.0)>1 for aggressive scaling, <1 for conservative.
downscaling_factorScale-down multiplier (default: 1.0)<1 for gentle downscaling, >1 for aggressive cost reduction.
metrics_interval_sMetric reporting frequency (default: 10s)Keep ≤ delay values. Lower for tighter control.
look_back_period_sAveraging window (default: 30s)Shorter for bursty workloads, longer for smoothing.

Troubleshooting common autoscaling issues

IssueSolution
Replica oscillationReduce upscaling_factor and downscaling_factor. Match look_back_period_s to your delay settings.
Latency spikes during burstsReduce upscale_delay_s, increase upscaling_factor, ensure metrics_interval_supscale_delay_s.
Premature scale-downIncrease downscale_delay_s, reduce downscaling_factor for gentler scaling.