Model configuration
This version of the Anyscale docs is deprecated. Go to the latest version for up to date information.
You can change the default autoscaling behavior, hardware resources, and default messages of an Anyscale Private Endpoint by changing a configuration in the console. For the vast majority of users, tweak settings in Express Configuration. If you want more granular control, modify the YAML directly in Advanced Configuration.
Express configuration
Specify the GPU type for each deployment as well as the autoscaling behavior for the endpoint.
Parameter | Description |
---|---|
Accelerator type | Type of GPU to serve the LLMs. |
Initial replicas | Number of replicas at the start of the deployment. default = 1 . |
Min replicas | Minimum number of replicas to maintain. Set this to 0 if you expect long periods of no traffic to save on cost. default = 1 . |
Max replicas | Maximum number of replicas per deployment. default = 2 . |
Advanced configuration
For enhanced customizability, modify the four main sections in the model YAML: deployment_config
, engine_config
, scaling_config
, and model_compute_config
.
Deployment configuration
The deployment_config
section corresponds to the Ray Serve configuration. It specifies how to auto-scale the model through autoscaling_config
and customize options for model deployments through ray_actor_options
.
It's recommended to use the default values for metrics_interval_s
, look_back_period_s
, smoothing_factor
, downscale_delay_s
and upscale_delay_s
. These are the configuration options you may want to modify:
min_replicas
,initial_replicas
,max_replicas
: Minimum, initial, and maximum number of model replicas to deploy on a Ray cluster.max_concurrent_queries
: Maximum queries each Ray Serve replica can handle simultaneously. Excess queries queue at the proxy.target_num_ongoing_requests_per_replica
: Guides auto-scaling behavior. Ray Serve scales up the replicas if the average ongoing request count exceeds this number, and scales down if it's lower. This is typically around 40% ofmax_concurrent_queries
.ray_actor_options
: Similar toresources_per_worker
in thescaling_config
section.smoothing_factor
: Influences the scaling decision's pace. Values below 1.0 decelerate the scaling process. See advanced auto-scaling guide for more details.
Engine configuration
The engine_config
section manages interactions with a model, managing its scheduling and execution.
model_id
: Model ID within RayLLM or the OpenAI API.type
: Inference engine type; onlyVLLMEngine
supported.engine_kwargs
,max_total_tokens
: Various configuration options for the inference engine, such as GPU memory utilization, quantization, and max number of concurrent sequences. These options vary based on the hardware accelerator and model size. RayLLM's configuration files offer tuned parameters for reference.enable_lora
: Boolean flag to enable multi LoRA serving (details here).
generation
: Contains default parameters for generation, likeprompt_format
andstopping_sequences
.hf_model_id
: The Hugging Face model ID; defaults tomodel_id
if unspecified.runtime_env
: Ray's runtime environment settings, allowing specific pip packages and environment variables per model. See Ray documentation on Runtime Environments for more information.s3_mirror_config
andgcs_mirror_config
: Configurations for loading models from S3 or Google Cloud Storage to expedite downloads, respectively, instead of the Hugging Face Hub.
RayLLM supports continuous batching, meaning it processes incoming requests upon arrival and adds them to batches that are already in processing. This means that the model isn't slowed down by certain sentences taking longer to generate than others. It also supports model quantization, allowing the deployment of compressed models on less resource-intensive hardware. See the quantization guide for more details.
Scaling configuration
The scaling_config
section specifies the resources required for serving the model, corresponding to Ray's ScalingConfig. Note that these settings apply to each model replica, not the entire model deployment.
num_workers
: Number of workers (Ray Actors) per model replica, controlling tensor parallelism.num_gpus_per_worker
: Number of GPUs allocated per worker. This should always be 1.num_cpus_per_worker
: Number of CPUs per worker. Usually set to 8.placement_strategy
: Ray supports different placement strategies to guide the physical distribution of workers. Use "STRICT_PACK" to ensure all workers are on the same node.resources_per_worker
: Sets Ray custom resources to assign models to specific node types. For example, always setaccelerator_type:L4
to 0.001 for a Llama-2-7b model for deployment on an L4 GPU. Thenum_gpus_per_worker
configuration along with number of GPUs available on the node determines the number of workers Ray schedules on the node. The supported accelerator types are: T4, L4, A10G, A100-40G and A100-80G.
Model compute configuration
The 'compute config' section specifies different hardware types for deployment. While you can add more instances, it's important to match the tensor parallelism , num_workers
, to the hardware capabilities. For instance, a tensor parallelism of 2 isn't feasible on a machine with only 1 GPU.
In advanced configurations, you might notice entries like accelerator_type_a10: 0.01
. This setting helps Ray assign and scale the model replica to the correct hardware. It's a mechanism for specifying physical versus logical resources, where a value like 0.01 doesn't represent a fraction of the resource but acts as a flag for Ray's resource scheduler. For a deeper understanding, refer to Ray's documentation on physical vs. logical resources
Serving LoRA Weights
Under the engine_config.engine_kwargs
block of the Advanced config
section, you can update enable_lora: false
to enable_lora: true
to enable multi LoRA serving where LoRA weights are fetched dynamically at query time.
Note: Model IDs follow the format {base_model_id}:{suffix}:{id}
,
such as meta-llama/Llama-2-7b-chat-hf:sql:rhmxn
.
You can inspect all your LoRA weights at ${ANYSCALE_ARTIFACT_STORAGE}/lora_fine_tuning
where ANYSCALE_ARTIFACT_STORAGE
is an environmental variable. In addition, you can continue to
query the base model id, such as meta-llama/Llama-2-7b-chat-hf
, while enable_lora
is set to true
.
Debugging deployment issues
Deployment issues may arise from several causes:
- An incorrect model ID specification.
- Gated Hugging Face models, such as the Llama family of models, require setting the
HUGGING_FACE_HUB_TOKEN
cluster-wide, either in the Ray cluster configuration or before runningserve run
. - Memory shortages, often indicated by "CUDA," "memory," and "NCCL," errors in replica logs or
serve run
output. Reducingmax_batch_prefill_tokens
andmax_batch_total_tokens
might resolve these issues. See example model configurations for valid YAML templates.
For broader debugging, the Ray Dashboard serves as a tool for monitoring applications and accessing Ray logs.