RayLLM Deployment Options
This guide covers the different methods of deploying RayLLM on Anyscale, including
- Deploying in a workspace with the Ray Serve CLI
- Deploying a production service from a workspace using the Anyscale CLI / SDK
- Deploying a production service from any machine outside of a workspace using the Anyscale CLI / SDK
Deploy within a workspace
RayLLM apps deployed with this method cannot be queried from outside of the workspace. This includes the Playground. This method is for quick iteration and testing.
This method lets you run and query RayLLM on a workspace. Use this method for development, testing, and quick iteration on your model with different configurations. You must have access to the Deploy LLM
workspace template.
This template creates a workspace with the latest version of RayLLM installed. Use the rayllm gen-config
interactive CLI to generate the config files for your desired models. You can also manually create the config files from scratch by following the [RayLLM] API and Services documentation.
Once you have the Serve config file, you can start the service by running the following command:
serve run config.yaml [--non-blocking]
Example config file for Llama-3.1-8B
# File name: config.yaml
applications:
- args:
llm_configs:
- model_loading_config:
model_id: meta-llama/Meta-Llama-3.1-8B-Instruct
model_source: meta-llama/Meta-Llama-3.1-8B-Instruct
runtime_env:
env_vars:
HUGGING_FACE_HUB_TOKEN: insert_your_hf_token_here
generation_config:
prompt_format:
assistant: "<|start_header_id|>assistant<|end_header_id|>\n\n{instruction}<|eot_id|>"
bos: <|begin_of_text|>
default_system_message: ''
system: "<|start_header_id|>system<|end_header_id|>\n\n{instruction}<|eot_id|>"
system_in_user: false
trailing_assistant: "<|start_header_id|>assistant<|end_header_id|>\n\n"
user: "<|start_header_id|>user<|end_header_id|>\n\n{instruction}<|eot_id|>"
stopping_sequences:
- <|end_of_text|>
- <|eot_id|>
input_modality: text
llm_engine: VLLMEngine
engine_kwargs:
enable_chunked_prefill: true
max_num_batched_tokens: 2048
max_num_seqs: 64
tokenizer_pool_extra_config:
runtime_env:
pip: null
tokenizer_pool_size: 2
trust_remote_code: true
json_mode:
enabled: false
lora_config: null
max_request_context_length: 8192
accelerator_type: A10G
tensor_parallelism:
degree: 1
deployment_config:
autoscaling_config:
target_ongoing_requests: 32
max_ongoing_requests: 64
import_path: rayllm:app
name: llm-endpoint
route_prefix: /
query_auth_token_enabled: false
Check your service's health by looking at the Ray dashboard's Serve metrics page.
Deploy a production service from a workspace
This method lets you use a workspace to launch a long-running, externally available service that can be queried from any machine. Use this method after you finish tuning your RayLLM config generated by the rayllm gen-config
CLI command in the Deploy LLM
workspace template. That way, you can launch your LLM as a production service without worrying about the cluster environment, resource specification, or compute configuation.
To deploy a service from a workspace, use the following command:
anyscale service deploy -f config.yaml
This command creates a long-running Anyscale Service using the Serve config and the workspace image. You can query the resulting production service from any machine.
If the service is deployed without token authentication, Anyscale automatically makes it available on the Playground. You can go to Playground and try out the service from there. However, if token authentication is enabled, you must manually add the service to the Playground.
Deploy a production service from any machine
This method lets you launch a long-running, externally available service from any machine, not just an Anyscale workspace. Use this method to launch RayLLM from your laptop, deploy RayLLM through CI/CD pipelines, and version-control your RayLLM configs.
The machine deploying RayLLM must have an Anyscale token and the Anyscale CLI installed. Take the same config file used when deploying from a workspace, and add the RayLLM image_uri
that the Anyscale Service should use. You can find the list of RayLLM images here. Start the service using the anyscale service deploy
command.
anyscale service deploy -n your_service_name -f config.yaml
Example changes to a config YAML file to deploy from any machine:
# File name: config.yaml
# applications are the same as above.
applications: ...
# Now image_uri is required. In the workspace, this came from the workspace image.
image_uri: localhost:5555/anyscale/endpoints_aica:1.0.0-7352