You can change the default autoscaling behavior, hardware resources, and default messages of an Anyscale Private Endpoint by changing a configuration in the console. For the vast majority of users, tweak settings in Express Configuration. If you want more granular control, modify the YAML directly in Advanced Configuration.
Specify the GPU type for each deployment as well as the autoscaling behavior for the endpoint.
|Type of GPU to serve the LLMs.
|Number of replicas at the start of the deployment.
|Minimum number of replicas to maintain. Set this to 0 if you expect long periods of no traffic to save on cost.
|Maximum number of replicas per deployment.
For enhanced customizability, modify the four main sections in the model YAML:
deployment_config section corresponds to the Ray Serve configuration. It specifies how to auto-scale the model through
autoscaling_config and customize options for model deployments through
It's recommended to use the default values for
upscale_delay_s. These are the configuration options you may want to modify:
max_replicas: Minimum, initial, and maximum number of model replicas to deploy on a Ray cluster.
max_concurrent_queries: Maximum queries each Ray Serve replica can handle simultaneously. Excess queries queue at the proxy.
target_num_ongoing_requests_per_replica: Guides auto-scaling behavior. Ray Serve scales up the replicas if the average ongoing request count exceeds this number, and scales down if it's lower. This is typically around 40% of
ray_actor_options: Similar to
smoothing_factor: Influences the scaling decision's pace. Values below 1.0 decelerate the scaling process. See advanced auto-scaling guide for more details.
engine_config section manages interactions with a model, managing its scheduling and execution.
model_id: Model ID within RayLLM or the OpenAI API.
type: Inference engine type; only
max_total_tokens: Various configuration options for the inference engine, such as GPU memory utilization, quantization, and max number of concurrent sequences. These options vary based on the hardware accelerator and model size. RayLLM's configuration files offer tuned parameters for reference.
generation: Contains default parameters for generation, like
hf_model_id: The Hugging Face model ID; defaults to
runtime_env: Ray's runtime environment settings, allowing specific pip packages and environment variables per model. See Ray documentation on Runtime Environments for more information.
gcs_mirror_config: Configurations for loading models from S3 or Google Cloud Storage to expedite downloads, respectively, instead of the Hugging Face Hub.
RayLLM supports continuous batching, meaning it processes incoming requests upon arrival and adds them to batches that are already in processing. This means that the model isn't slowed down by certain sentences taking longer to generate than others. It also supports model quantization, allowing the deployment of compressed models on less resource-intensive hardware. See the quantization guide for more details.
scaling_config section specifies the resources required for serving the model, corresponding to Ray's ScalingConfig. Note that these settings apply to each model replica, not the entire model deployment.
num_workers: Number of workers (Ray Actors) per model replica, controlling tensor parallelism.
num_gpus_per_worker: Number of GPUs allocated per worker. This should always be 1.
num_cpus_per_worker: Number of CPUs per worker. Usually set to 8.
placement_strategy: Ray supports different placement strategies to guide the physical distribution of workers. Use "STRICT_PACK" to ensure all workers are on the same node.
resources_per_worker: Sets Ray custom resources to assign models to specific node types. For example, always set
accelerator_type:L4to 0.001 for a Llama-2-7b model for deployment on an L4 GPU. The
num_gpus_per_workerconfiguration along with number of GPUs available on the node determines the number of workers Ray schedules on the node. The supported accelerator types are: T4, L4, A10G, A100-40G and A100-80G.
Model compute configuration
The 'compute config' section specifies different hardware types for deployment. While you can add more instances, it's important to match the tensor parallelism ,
num_workers, to the hardware capabilities. For instance, a tensor parallelism of 2 isn't feasible on a machine with only 1 GPU.
In advanced configurations, you might notice entries like
accelerator_type_a10: 0.01. This setting helps Ray assign and scale the model replica to the correct hardware. It's a mechanism for specifying physical versus logical resources, where a value like 0.01 doesn't represent a fraction of the resource but acts as a flag for Ray's resource scheduler. For a deeper understanding, refer to Ray's documentation on physical vs. logical resources
Debugging deployment issues
Deployment issues may arise from several causes:
- An incorrect model ID specification.
- Gated Hugging Face models, such as the Llama family of models, require setting the
HUGGING_FACE_HUB_TOKENcluster-wide, either in the Ray cluster configuration or before running
- Memory shortages, often indicated by "CUDA," "memory," and "NCCL," errors in replica logs or
serve runoutput. Reducing
max_batch_total_tokensmight resolve these issues. See example model configurations for valid YAML templates.
For broader debugging, the Ray Dashboard serves as a tool for monitoring applications and accessing Ray logs.