Multi-LoRA Support
RayLLM offers multi-LoRA support, which enables running many LoRA fine-tuned adapters on a single base-model deployment. Multi-LoRA improves resource utilization and cuts cost. This guide will help you understand
- What multi-LoRA support is
- Why multi-LoRA support helps save cost
- How to configure multi-LoRA support
Background
Suppose you run a movie-recommendation system using LLMs. The system may have LoRA fine-tuned weights for each user, so it can personalize responses user-by-user. Without multi-LoRA support, the system must run one base model per LoRA adapter, which means the number of GPUs (and the cost!) would scale with the number of users.
Multi-LoRA is a feature that lets multiple LoRA adapters share a single base-model. When a user queries a particular adapter, the adapter is loaded onto the model with a least-recently-used caching policy. Furthermore, the system can batch requests from different users with different LoRAs together and decode the requests in a single forward pass. This significantly reduces the number of GPUs needed to run multiple LoRA adapters. Getting this right is tricky: it requires careful scheduling, memory management, and scaling. Fortunately, RayLLM provides multi-LoRA support out of the box.
Setting up weights
RayLLM pulls LoRA adapter weights from a cloud storage bucket. Store your LoRA adapter files in the following directory:
[lora_dynamic_path]/[base_model_id]:[lora_adapter_suffix]
The lora_dynamic_path
can be any path in the cloud. The base_model_id
should match the model_id
in your model config YAML file. The lora_adapter_suffix
can be any string and acts as an ID for this particular set of LoRA adapter weights. The client will use this suffix to specify the LoRA weights for the request.
An example path is:
s3://my_lora_bucket/my/path/meta-llama/Meta-Llama-3.1-70B-Instruct:my_suffix
Enabling multi-LoRA support
To enable multi-lora support, configure the lora_config
in your model's llm_config
. Set the dynamic_lora_loading_path
and the max_num_adapters_per_replica
. For example:
model_loading_config:
model_id: meta-llama/Meta-Llama-3.1-70B-Instruct
...
...
lora_config:
dynamic_lora_loading_path: s3://my_lora_bucket/my/path
max_num_adapters_per_replica: 16
...
dynamic_lora_loading_path
: this is the path that contains all your LoRA adapters. This path must be the same across all the LoRA fine-tuned models running on a single cluster.max_num_adapters_per_replica
: The maximum number of LoRA adapters that can share a single base model.
Note that you cannot use multi-LoRA and JSON mode on the same model.
Querying multi-LoRA models
To query multi-LoRA models, append the LoRA suffix to the model ID:
% curl $ENDPOINT_URL/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Meta-Llama-3.1-70B-Instruct:my_suffix",
"messages": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Hello!"}]
}'
Example use case: serverless fine-tuning platform
Multi-LoRA allows you to offer a serverless inference platform for fine-tuned weights. First, set up a LoRA dynamic path where you can push fine-tuned weights. If you're using LLMForge on Anyscale, this path will automatically be set up, and all fine-tuned weights will be pushed to it. Next, start a model with RayLLM that reads from that dynamic path. Now, you can fine-tune the base-model and push the fine-tuned adapter weights to the same dynamic LoRA path. The new adapter weights will be available to be queried immediately without any additional deployments or changes to the model configuration.
This powerful design pattern enables your team's data-scientists and ML engineers to deploy and query fine-tuned weights without worrying about the underlying server infrastructure.