Serving Open-weight LLMs on Anyscale

With RayLLM, you can run open-weight LLMs on Anyscale. Compared to using closed-weight LLMs, self-serving open-weight LLMs offers several advantages:

More control: You can optimize for the best cost-quality trade-off depending on your use case.
Better alignment: You can fine-tune the weights to better match your use case and self-deploy the aligned models.
No lock-in: You can export, save, and migrate your fine-tuned weights across different platforms and integrate them with your existing infrastructure.

What is RayLLM?

RayLLM is an LLM serving system for open-weight LLMs. It runs LLM inference engines such as vLLM and provides out-of-the-box autoscaling, observability, fault-tolerance, and more.

RayLLM is built on top of Ray Serve, a highly scalable and efficient ML serving system. RayLLM runs the LLM inference engine as a Ray Serve deployment, so it can leverage Ray Serve to scale, schedule, and health-check the inference engine.

RayLLM supports a number of important features, including:

Autoscaling: RayLLM provides request-based autoscaling out of the box, which starts new model replicas and compute nodes based on the number of ongoing and pending requests. RayLLM also offers scale-to-zero, which lets your cluster scale down to 0 GPUs when not in use.
Multi-LoRA inference: RayLLM uses efficient scheduling methods and custom GPU kernels to achieve low-latency and high-throughput inference on fine-tuned LoRA adapters.
Multi-model services: RayLLM can run multiple LLMs on a single cluster, which improves resource utilization and simplifies model management.
JSON mode: RayLLM provides the capability to produce JSON formatted response with support for constrained schemas. This is useful for integrating LLMs with other systems where structured outputs are needed such as in the case of tool calling.
OpenAI API: RayLLM provides an OpenAI compatible Rest API for easy integration with other LLM development tools. See this guide for migrating from OpenAI.
Bring any custom models: RayLLM provides some out-of-the-box defaults for popular models like Llama and Mistral, along with the ability to bring any model supported by the vLLM engine.
Observability: RayLLM provides dashboard, metrics, and logs, so you can monitor your model's health and usage.

How do you run RayLLM?

You can start from this workspace template which has all the dependencies installed. Follow the template instructions to get your service up and running.

The workspace helps you generate RayLLM config files that specify the model to run as well as settings like the max context length and prompt format. You can find the API documentation here for a detailed explanation of the configurations available.

Example config file for Llama-3.1-8B

# File name: config.yaml

applications:
- args:
    llm_configs:

    - model_loading_config:
        model_id: meta-llama/Meta-Llama-3.1-8B-Instruct
        model_source: meta-llama/Meta-Llama-3.1-8B-Instruct

      runtime_env:
        env_vars:
          HUGGING_FACE_HUB_TOKEN: insert_your_hf_token_here

      generation_config:
        prompt_format:
          assistant: "<|start_header_id|>assistant<|end_header_id|>\n\n{instruction}<|eot_id|>"
          bos: <|begin_of_text|>
          default_system_message: ''
          system: "<|start_header_id|>system<|end_header_id|>\n\n{instruction}<|eot_id|>"
          system_in_user: false
          trailing_assistant: "<|start_header_id|>assistant<|end_header_id|>\n\n"
          user: "<|start_header_id|>user<|end_header_id|>\n\n{instruction}<|eot_id|>"
        stopping_sequences:
        - <|end_of_text|>
        - <|eot_id|>

      input_modality: text

      llm_engine: VLLMEngine
      engine_kwargs:
        enable_chunked_prefill: true
        max_num_batched_tokens: 2048
        max_num_seqs: 64
        tokenizer_pool_extra_config:
          runtime_env:
            pip: null
        tokenizer_pool_size: 2
        trust_remote_code: true

      json_mode:
        enabled: false

      lora_config: null
      max_request_context_length: 8192

      accelerator_type: A10G

      tensor_parallelism:
        degree: 1

      deployment_config:
        autoscaling_config:
          target_ongoing_requests: 32
        max_ongoing_requests: 64

  import_path: rayllm:app
  name: llm-endpoint
  route_prefix: /
query_auth_token_enabled: false

On the workspace, you can run the service locally using serve run config.yaml. Once the service is healthy, you can verify that the model works by querying it:

curl
OpenAI client

% curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
    "messages": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Hello!"}],
    "temperature": 0.7
  }'

from openai import OpenAI

client = OpenAI(
  base_url="http://localhost:8000/v1",
  api_key="NONE",
)
chat_completions = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Hello!"},
    ],
    temperature=0.7,
    stream=True,
)

for chat in chat_completions:
    if chat.choices[0].delta.content is not None:
        print(chat.choices[0].delta.content, end="")

{... "content":"Hello, I am here to help you. What would you like me to do today?" ...}

You can also run RayLLM as a long-running production Anyscale Service. From an Anyscale workspace, you can run

anyscale service deploy -f config.yaml

This starts a service that runs the RayLLM application using the same image and compute configuration as the workspace. For deploying without an Anyscale workspace, the instructions can be found here.

The service automatically provisions a load balancer and authentication token, which you can use to query the service. You can also query a service in the Playground and compare multiple models side-by-side.

What is RayLLM?​

How do you run RayLLM?​

What is RayLLM?

How do you run RayLLM?