Skip to main content

RayLLM Deployment Options

This guide covers the different methods of deploying RayLLM on Anyscale, including

Deploy within a workspace

note

RayLLM apps deployed with this method cannot be queried from outside of the workspace. This includes the Playground. This method is for quick iteration and testing.

This method lets you run and query RayLLM on a workspace. Use this method for development, testing, and quick iteration on your model with different configurations. You must have access to the Deploy LLM workspace template.

This template creates a workspace with the latest version of RayLLM installed. Use the rayllm gen-config interactive CLI to generate the config files for your desired models. You can also manually create the config files from scratch by following the [RayLLM] API and Services documentation.

Once you have the Serve config file, you can start the service by running the following command:

serve run config.yaml [--non-blocking]
Example config file for Llama-3.1-8B
# File name: config.yaml

applications:
- args:
llm_configs:

- model_loading_config:
model_id: meta-llama/Meta-Llama-3.1-8B-Instruct
model_source: meta-llama/Meta-Llama-3.1-8B-Instruct

runtime_env:
env_vars:
HUGGING_FACE_HUB_TOKEN: insert_your_hf_token_here

generation_config:
prompt_format:
assistant: "<|start_header_id|>assistant<|end_header_id|>\n\n{instruction}<|eot_id|>"
bos: <|begin_of_text|>
default_system_message: ''
system: "<|start_header_id|>system<|end_header_id|>\n\n{instruction}<|eot_id|>"
system_in_user: false
trailing_assistant: "<|start_header_id|>assistant<|end_header_id|>\n\n"
user: "<|start_header_id|>user<|end_header_id|>\n\n{instruction}<|eot_id|>"
stopping_sequences:
- <|end_of_text|>
- <|eot_id|>

input_modality: text

llm_engine: VLLMEngine
engine_kwargs:
enable_chunked_prefill: true
max_num_batched_tokens: 2048
max_num_seqs: 64
tokenizer_pool_extra_config:
runtime_env:
pip: null
tokenizer_pool_size: 2
trust_remote_code: true

json_mode:
enabled: false

lora_config: null
max_request_context_length: 8192

accelerator_type: A10G

tensor_parallelism:
degree: 1

deployment_config:
autoscaling_config:
target_ongoing_requests: 32
max_ongoing_requests: 64

import_path: rayllm:app
name: llm-endpoint
route_prefix: /
query_auth_token_enabled: false

Check your service's health by looking at the Ray dashboard's Serve metrics page.

Deploy a production service from a workspace

This method lets you use a workspace to launch a long-running, externally available service that can be queried from any machine. Use this method after you finish tuning your RayLLM config generated by the rayllm gen-config CLI command in the Deploy LLM workspace template. That way, you can launch your LLM as a production service without worrying about the cluster environment, resource specification, or compute configuation.

To deploy a service from a workspace, use the following command:

anyscale service deploy -f config.yaml

This command creates a long-running Anyscale Service using the Serve config and the workspace image. You can query the resulting production service from any machine.

note

If the service is deployed without token authentication, Anyscale automatically makes it available on the Playground. You can go to Playground and try out the service from there. However, if token authentication is enabled, you must manually add the service to the Playground.

Deploy a production service from any machine

This method lets you launch a long-running, externally available service from any machine, not just an Anyscale workspace. Use this method to launch RayLLM from your laptop, deploy RayLLM through CI/CD pipelines, and version-control your RayLLM configs.

The machine deploying RayLLM must have an Anyscale token and the Anyscale CLI installed. Take the same config file used when deploying from a workspace, and add the RayLLM image_uri that the Anyscale Service should use. You can find the list of RayLLM images here. Start the service using the anyscale service deploy command.

anyscale service deploy -n your_service_name -f config.yaml

Example changes to a config YAML file to deploy from any machine:

# File name: config.yaml

# applications are the same as above.
applications: ...
# Now image_uri is required. In the workspace, this came from the workspace image.
image_uri: localhost:5555/anyscale/endpoints_aica:1.0.0-7352