Skip to main content

Model configuration

You can change the default autoscaling behavior, hardware resources, and default messages of an Anyscale Private Endpoint by changing a configuration in the console. For the vast majority of users, tweak settings in Express Configuration. If you want more granular control, modify the YAML directly in Advanced Configuration.

Express configuration

Specify the GPU type for each deployment as well as the autoscaling behavior for the endpoint.

ParameterDescription
Accelerator typeType of GPU to serve the LLMs.
Initial replicasNumber of replicas at the start of the deployment. default = 1.
Min replicasMinimum number of replicas to maintain. Set this to 0 if you expect long periods of no traffic to save on cost. default = 1.
Max replicasMaximum number of replicas per deployment. default = 2.

Advanced configuration

For enhanced customizability, modify the four main sections in the model YAML: deployment_config, engine_config, scaling_config, and model_compute_config.

Deployment configuration

The deployment_config section corresponds to the Ray Serve configuration. It specifies how to auto-scale the model through autoscaling_config and customize options for model deployments through ray_actor_options.

It's recommended to use the default values for metrics_interval_s, look_back_period_s, smoothing_factor, downscale_delay_s and upscale_delay_s. These are the configuration options you may want to modify:

  • min_replicas, initial_replicas, max_replicas: Minimum, initial, and maximum number of model replicas to deploy on a Ray cluster.
  • max_concurrent_queries: Maximum queries each Ray Serve replica can handle simultaneously. Excess queries queue at the proxy.
  • target_num_ongoing_requests_per_replica: Guides auto-scaling behavior. Ray Serve scales up the replicas if the average ongoing request count exceeds this number, and scales down if it's lower. This is typically around 40% of max_concurrent_queries.
  • ray_actor_options: Similar to resources_per_worker in the scaling_config section.
  • smoothing_factor: Influences the scaling decision's pace. Values below 1.0 decelerate the scaling process. See advanced auto-scaling guide for more details.

Engine configuration

The engine_config section manages interactions with a model, managing its scheduling and execution.

  • model_id: Model ID within RayLLM or the OpenAI API.
  • type: Inference engine type; only VLLMEngine supported.
  • engine_kwargs, max_total_tokens: Various configuration options for the inference engine, such as GPU memory utilization, quantization, and max number of concurrent sequences. These options vary based on the hardware accelerator and model size. RayLLM's configuration files offer tuned parameters for reference.
    • enable_lora: Boolean flag to enable multi LoRA serving (details here).
  • generation: Contains default parameters for generation, like prompt_format and stopping_sequences.
  • hf_model_id: The Hugging Face model ID; defaults to model_id if unspecified.
  • runtime_env: Ray's runtime environment settings, allowing specific pip packages and environment variables per model. See Ray documentation on Runtime Environments for more information.
  • s3_mirror_config and gcs_mirror_config: Configurations for loading models from S3 or Google Cloud Storage to expedite downloads, respectively, instead of the Hugging Face Hub.
tip

RayLLM supports continuous batching, meaning it processes incoming requests upon arrival and adds them to batches that are already in processing. This means that the model isn't slowed down by certain sentences taking longer to generate than others. It also supports model quantization, allowing the deployment of compressed models on less resource-intensive hardware. See the quantization guide for more details.

Scaling configuration

The scaling_config section specifies the resources required for serving the model, corresponding to Ray's ScalingConfig. Note that these settings apply to each model replica, not the entire model deployment.

  • num_workers: Number of workers (Ray Actors) per model replica, controlling tensor parallelism.
  • num_gpus_per_worker: Number of GPUs allocated per worker. This should always be 1.
  • num_cpus_per_worker: Number of CPUs per worker. Usually set to 8.
  • placement_strategy: Ray supports different placement strategies to guide the physical distribution of workers. Use "STRICT_PACK" to ensure all workers are on the same node.
  • resources_per_worker: Sets Ray custom resources to assign models to specific node types. For example, always set accelerator_type:L4 to 0.001 for a Llama-2-7b model for deployment on an L4 GPU. The num_gpus_per_worker configuration along with number of GPUs available on the node determines the number of workers Ray schedules on the node. The supported accelerator types are: T4, L4, A10G, A100-40G and A100-80G.

Model compute configuration

The 'compute config' section specifies different hardware types for deployment. While you can add more instances, it's important to match the tensor parallelism , num_workers, to the hardware capabilities. For instance, a tensor parallelism of 2 isn't feasible on a machine with only 1 GPU.

note

In advanced configurations, you might notice entries like accelerator_type_a10: 0.01. This setting helps Ray assign and scale the model replica to the correct hardware. It's a mechanism for specifying physical versus logical resources, where a value like 0.01 doesn't represent a fraction of the resource but acts as a flag for Ray's resource scheduler. For a deeper understanding, refer to Ray's documentation on physical vs. logical resources

Serving LoRA Weights

Under the engine_config.engine_kwargs block of the Advanced config section, you can update enable_lora: false to enable_lora: true to enable multi LoRA serving where LoRA weights are fetched dynamically at query time.

Note: Model IDs follow the format {base_model_id}:{suffix}:{id}, such as meta-llama/Llama-2-7b-chat-hf:sql:rhmxn.

You can inspect all your LoRA weights at ${ANYSCALE_ARTIFACT_STORAGE}/lora_fine_tuning where ANYSCALE_ARTIFACT_STORAGE is an environmental variable. In addition, you can continue to query the base model id, such as meta-llama/Llama-2-7b-chat-hf, while enable_lora is set to true.

Debugging deployment issues

Deployment issues may arise from several causes:

  1. An incorrect model ID specification.
  2. Gated Hugging Face models, such as the Llama family of models, require setting the HUGGING_FACE_HUB_TOKEN cluster-wide, either in the Ray cluster configuration or before running serve run.
  3. Memory shortages, often indicated by "CUDA," "memory," and "NCCL," errors in replica logs or serve run output. Reducing max_batch_prefill_tokens and max_batch_total_tokens might resolve these issues. See example model configurations for valid YAML templates.

For broader debugging, the Ray Dashboard serves as a tool for monitoring applications and accessing Ray logs.