RayLLM APIs are in Beta.
Serving Open-weight LLMs on Anyscale
With RayLLM, you can run open-weight LLMs on Anyscale. Compared to using closed-weight LLMs, self-serving open-weight LLMs offers several advantages:
- More control: You can optimize for the best cost-quality trade-off depending on your use case.
- Better alignment: You can fine-tune the weights to better match your use case and self-deploy the aligned models.
- No lock-in: You can export, save, and migrate your fine-tuned weights across different platforms and integrate them with your existing infrastructure.
What is RayLLM?
RayLLM is an LLM serving system for open-weight LLMs. It runs LLM inference engines such as vLLM and provides out-of-the-box autoscaling, observability, fault-tolerance, and more.
RayLLM is built on top of Ray Serve, a highly scalable and efficient ML serving system. RayLLM runs the LLM inference engine as a Ray Serve deployment, so it can leverage Ray Serve to scale, schedule, and health-check the inference engine.
RayLLM supports a number of important features, including:
- Autoscaling: RayLLM provides request-based autoscaling out of the box, which starts new model replicas and compute nodes based on the number of ongoing and pending requests. RayLLM also offers scale-to-zero, which lets your cluster scale down to 0 GPUs when not in use.
- Multi-LoRA inference: RayLLM uses efficient scheduling methods and custom GPU kernels to achieve low-latency and high-throughput inference on fine-tuned LoRA adapters.
- Multi-model services: RayLLM can run multiple LLMs on a single cluster, which improves resource utilization and simplifies model management.
- JSON mode: RayLLM provides the capability to produce JSON formatted response with support for constrained schemas. This is useful for integrating LLMs with other systems where structured outputs are needed such as in the case of tool calling.
- OpenAI API: RayLLM provides an OpenAI compatible Rest API for easy integration with other LLM development tools. See this guide for migrating from OpenAI.
- Bring any custom models: RayLLM provides some out-of-the-box defaults for popular models like Llama and Mistral, along with the ability to bring any model supported by the vLLM engine.
- Observability: RayLLM provides dashboard, metrics, and logs, so you can monitor your model's health and usage.
How do you run RayLLM?
You can start from this workspace template which has all the dependencies installed. Follow the template instructions to get your service up and running.
The workspace helps you generate RayLLM config files that specify the model to run as well as settings like the max context length and prompt format. You can find the API documentation here for a detailed explanation of the configurations available.
Example config file for Llama-3.1-8B
# File name: config.yaml
applications:
- args:
llm_configs:
- model_loading_config:
model_id: meta-llama/Meta-Llama-3.1-8B-Instruct
model_source: meta-llama/Meta-Llama-3.1-8B-Instruct
runtime_env:
env_vars:
HUGGING_FACE_HUB_TOKEN: insert_your_hf_token_here
generation_config:
prompt_format:
assistant: "<|start_header_id|>assistant<|end_header_id|>\n\n{instruction}<|eot_id|>"
bos: <|begin_of_text|>
default_system_message: ''
system: "<|start_header_id|>system<|end_header_id|>\n\n{instruction}<|eot_id|>"
system_in_user: false
trailing_assistant: "<|start_header_id|>assistant<|end_header_id|>\n\n"
user: "<|start_header_id|>user<|end_header_id|>\n\n{instruction}<|eot_id|>"
stopping_sequences:
- <|end_of_text|>
- <|eot_id|>
input_modality: text
llm_engine: VLLMEngine
engine_kwargs:
enable_chunked_prefill: true
max_num_batched_tokens: 2048
max_num_seqs: 64
tokenizer_pool_extra_config:
runtime_env:
pip: null
tokenizer_pool_size: 2
trust_remote_code: true
json_mode:
enabled: false
lora_config: null
max_request_context_length: 8192
accelerator_type: A10G
tensor_parallelism:
degree: 1
deployment_config:
autoscaling_config:
target_ongoing_requests: 32
max_ongoing_requests: 64
import_path: rayllm:app
name: llm-endpoint
route_prefix: /
query_auth_token_enabled: false
On the workspace, you can run the service locally using serve run config.yaml
. Once the service is healthy, you can verify that the model works by querying it:
- curl
- OpenAI client
% curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
"messages": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Hello!"}],
"temperature": 0.7
}'
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="NONE",
)
chat_completions = client.chat.completions.create(
model="meta-llama/Meta-Llama-3.1-8B-Instruct",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Hello!"},
],
temperature=0.7,
stream=True,
)
for chat in chat_completions:
if chat.choices[0].delta.content is not None:
print(chat.choices[0].delta.content, end="")
{... "content":"Hello, I am here to help you. What would you like me to do today?" ...}
You can also run RayLLM as a long-running production Anyscale Service. From an Anyscale workspace, you can run
anyscale service deploy -f config.yaml
This starts a service that runs the RayLLM application using the same image and compute configuration as the workspace. For deploying without an Anyscale workspace, the instructions can be found here.
The service automatically provisions a load balancer and authentication token, which you can use to query the service. You can also query a service in the Playground and compare multiple models side-by-side.