Skip to main content

Migration guide: Transitioning from RayLLM to Ray Serve LLM API

This guide explains how to migrate your LLM deployment from the legacy RayLLM to the new Ray Serve LLM API (ray [serve, llm]≥2.44.0).

The overall deployment process remains the same; the primary change is that you need to update the LLM Model Config and Serve Config files to work with the new Ray Serve LLM API.

Migration summary table for LLMConfig

LLMConfig is a central configuration object used to map the LLM Model Config YAML file that drives the deployment. There are substantial changes in LLMConfig from the RayLLM to the new Ray Serve LLM API.

Below is an updated migration summary table that not only maps the common fields in LLMConfig from the RayLLM to the new Ray Serve LLM API, but also includes additional details for customizations you may have added.

Tip: you can start by running the migration script provided below to migrate the old LLM Model config file to the new one. Then, consult the migration summary table if you need additional details for specific configuration fields.

For additional reference, see:

FieldDescriptionMigration ActionMigration DetailsAdditional Notes
accelerator_typeThe type of accelerator used to run the model.Value updatedThe new Ray Serve LLM API supports additional accelerator types, check the full list here: https://docs.ray.io/en/latest/serve/api/doc/ray.serve.llm.LLMConfig.html

Accelerator Name Updates:
A10 changes to A10G
A100_40G changes to A100-40G
A100_80G changes to A100-80G
N/A
deployment_configThe Ray Serve deployment settings for the model deployment.No changeN/AN/A
engine_kwargsAdditional keyword arguments for the engine.Added new fieldsSupport tensor_parallel_size and max_model_len fields in the engine_kwargs.For full list of engine_kwargs: https://docs.vllm.ai/en/latest/serving/engine_args.html
generation_configSettings for prompt formatting and token interpretation.Removedgeneration_config and prompt_format are no longer used. The new LLM Serve API imports configurations such as prompt_format directly from Hugging Face. Models coming from Huggingface ecosystem use a templating pattern outlined here (https://huggingface.co/docs/transformers/main/en/chat_templating)If you previously used stopping_sequences or stopping_tokens, use the vLLM sampling parameter stop_token_ids, which defines tokens that halt generation upon appearance. See: vLLM Sampling Params.

You can pass extra_body={"stop_token_ids": stop_token_ids} in the OpenAI chat client API to use these sampling parameters in vLLM.
input_modalityDefines the type of request that the model accepts.RemovedInput modality is now automatically derived from the model.N/A
json_modeConfiguration for JSON mode.RemovedJSON mode is automatically enabled in the new Ray Serve LLM during deployment.Set response_format={"type": "json_object"} in the OpenAI’s chat client API.

https://docs.vllm.ai/en/latest/features/structured_outputs.html for usage details.

In addition, you can pass extra_body={"guided_json": json_schema} to include the JSON schema. For more structured outputs supported by vllm, see: https://docs.vllm.ai/en/latest/features/structured_outputs.html
llm_engineThe LLM engine used to run the model.Kept, but restricted to ['vLLM'].This field is optional. If provided, only the parameter vLLM is accepted as valid.N/A
lora_configSettings for LoRA adapter.No change*There is no change on the configuration.

LoRA weight storage path name requirements are simplified while maintaining compatibility with previous paths.
Previously, RayLLM stored LoRA weights were stored in paths that combined the base model ID with a custom suffix (for example s3://dynamic_path/[base_model]:[suffix]).

Now, they are stored using a unified path structure where each LoRA weight has a unique ID directly under a common directory (for example s3://dynamic_path/[lora_id]).

This change removes the dependency on the base model name, making storage more flexible and easier to manage.

Check the example to learn how to deploy Multi-LoRA Deployment using the new LLM API: https://docs.ray.io/en/latest/serve/llm/serving-llms.html#multi-lora-deployment
max_request_context_lengthMaximum tokens (input + generated) per request. Must be ≤ model context length.RefactoredUse max_model_len field in the engine_kwargs instead.Ray Serve LLM directly uses the max_model_len in the engine_kwargs field specified in the vLLM engine.
https://docs.vllm.ai/en/latest/serving/engine_args.html
model_loading_configConfiguration for downloading and exposing the model.No changeN/AN/A
runtime_envRuntime environment settings for deployment replicas and engine workers.Env variable name changedReplace HUGGING_FACE_HUB_TOKEN with HF_TOKEN.As of Ray 2.43.0, both HUGGING_FACE_HUB_TOKEN and HF_TOKEN are supported, but HUGGING_FACE_HUB_TOKEN will be deprecated soon.
tensor_parallelismTensor parallelism settings for the model.RefactoredUpdate tensor_parallelism.degree to tensor_parallel_size field in the engine_kwargs .Ray Serve LLM directly uses the tensor_parallel_size in the engine_kwargs field specified in the vLLM engine.
https://docs.vllm.ai/en/latest/serving/engine_args.html

Migration script for LLM model config

The following Python script automates most of the migration process. It reads your old LLM Model Config YAML file, applies the necessary changes, and writes out the new configuration file: new_llm_config.yaml.

import yaml
from pprint import pprint

config_path = "old_llm_config.yaml" ## Replace with your old LLM config file.

with open(config_path, "r") as f:
config_dict = yaml.safe_load(f)

# Define the whitelist of supported parameters for LLMConfig.
allowed_keys = {
"accelerator_type",
"deployment_config",
"engine_kwargs",
"llm_engine",
"lora_config",
"model_loading_config",
"runtime_env",
}

# Filter the loaded config to only include allowed keys.
new_config = {key: value for key, value in config_dict.items() if key in allowed_keys}

# Update runtime_env to rename HUGGING_FACE_HUB_TOKEN to HF_TOKEN.
if "runtime_env" in new_config:
env_vars = new_config["runtime_env"].get("env_vars", {})
if "HUGGING_FACE_HUB_TOKEN" in env_vars:
env_vars["HF_TOKEN"] = env_vars.pop("HUGGING_FACE_HUB_TOKEN")
new_config["runtime_env"]["env_vars"] = env_vars


# Update accelerator_type if necessary using a mapping.
accelerator_mapping = {
"A10": "A10G",
"A100_40G": "A100-40G",
"A100_80G": "A100-80G"
}
if "accelerator_type" in new_config:
current = new_config["accelerator_type"]
if current in accelerator_mapping:
new_config["accelerator_type"] = accelerator_mapping[current]

# Refactor max_request_context_length to max_model_len inside engine_kwargs.
if "max_request_context_length" in config_dict:
# Ensure engine_kwargs exists.
if "engine_kwargs" not in new_config or new_config["engine_kwargs"] is None:
new_config["engine_kwargs"] = {}
new_config["engine_kwargs"]["max_model_len"] = config_dict["max_request_context_length"]

# If tensor_parallelism is specified in the YAML,
# add its "degree" value to engine_kwargs as tensor_parallel_size.
if "tensor_parallelism" in config_dict:
degree = config_dict["tensor_parallelism"].get("degree", 1)
# Ensure engine_kwargs exists.
if "engine_kwargs" not in new_config or new_config["engine_kwargs"] is None:
new_config["engine_kwargs"] = {}
new_config["engine_kwargs"]["tensor_parallel_size"] = degree

# Force the llm_engine to be "vllm" (supported engine) regardless of the YAML content.
new_config["llm_engine"] = "vLLM"

print("The new LLM configuration is:")
pprint(new_config)

# Save the new configuration to a YAML file.
output_path = "new_llm_config.yaml"
with open(output_path, "w") as outfile:
yaml.safe_dump(new_config, outfile)

print(f"New configuration saved to {output_path}")

Step-by-step migration guide


Step 1: Verify prerequisites and install necessary packages

Before migrating, ensure your environment meets the requirements for the new Ray Serve LLM API (ray [serve, llm]≥2.44.0).

Alternatively, you can use the docker image such as anyscale/ray-llm:2.44.1-py311-cu124 which has all the necessary packages installed. Feel free to check the newest ray-llm docker image on Anyscale.

For more details, see:

https://docs.ray.io/en/latest/serve/llm/serving-llms.html


Step 2: Review old configurations in RayLLM

The old deployment uses two YAML files, one for Serve Config and one for LLM Model Config. Below are examples of the previous configuration files.

  • Old Serve Config (serve_20250311213249.yaml):
applications:
- args:
llm_configs:
- ./model_config/Qwen--Qwen2_5-32B_20250311213249.yaml
import_path: rayllm:app
name: llm-endpoint
route_prefix: /
query_auth_token_enabled: true
  • Old LLM Model Config (Qwen--Qwen2_5-32B_20250311213249.yaml):
accelerator_type: A10
deployment_config:
autoscaling_config:
target_ongoing_requests: 32
max_ongoing_requests: 64
engine_kwargs:
max_num_batched_tokens: 8192
max_num_seqs: 64
tokenizer_pool_extra_config:
runtime_env:
pip: null
tokenizer_pool_size: 2
trust_remote_code: true
generation_config:
prompt_format:
use_hugging_face_chat_template: true
stopping_sequences: []
stopping_tokens: []
input_modality: text
llm_engine: VLLMMQEngine
lora_config: null
max_request_context_length: 8192
model_loading_config:
model_id: Qwen/Qwen2.5-32B-Instruct
model_source: Qwen/Qwen2.5-32B-Instruct
runtime_env:
env_vars:
HUGGING_FACE_HUB_TOKEN: <your_hf_token>
VLLM_ALLOW_LONG_MAX_MODEL_LEN: '1'
tensor_parallelism:
degree: 4

Step 3: Create the new LLM Configuration file using the migration script

Update the migration script’s config_path variable with your old llm model configuration file’s path.


config_path = "Qwen--Qwen2_5-32B_20250311213249.yaml" ## replace with your old llm config file

Run the migration script above and generate the new_llm_config.yaml as below:

accelerator_type: A10G
deployment_config:
autoscaling_config:
target_ongoing_requests: 32
max_ongoing_requests: 64
engine_kwargs:
max_model_len: 8192
max_num_batched_tokens: 8192
max_num_seqs: 64
tensor_parallel_size: 4
tokenizer_pool_extra_config:
runtime_env:
pip: null
tokenizer_pool_size: 2
trust_remote_code: true
llm_engine: vLLM
lora_config: null
model_loading_config:
model_id: Qwen/Qwen2.5-32B-Instruct
model_source: Qwen/Qwen2.5-32B-Instruct
runtime_env:
env_vars:
HF_TOKEN: <your_hf_token>
VLLM_ALLOW_LONG_MAX_MODEL_LEN: '1'

Here’s how each field transitions:

  • engine_kwargs and tensor_parallelism:
    • Action: Move degree: 4 from tensor_parallelism to tensor_parallel_size: 4 in engine_kwargs.
  • generation_config:
    • Action: Removed this field. Use extra_body={"stop_token_ids": [ids]} in the OpenAI client for stopping tokens if needed.
  • input_modality:
    • Action: Remove this field; auto-derived from the model.
  • llm_engine:
    • Action: Update to VLLM, as only this is supported now.
  • max_request_context_length: In the OpenAI client.
    • Action: Move max_request_context_length to max_model_len in engine_kwargs.
  • runtime_env:
    • Action: Rename the environment variable HUGGING_FACE_HUB_TOKEN as HF_TOKEN.

Step 4: Update the Serve config file

  1. Update the path of your new LLM model configuration file (./model_config/new_llm_config.yaml).
  2. In addition, update the import_path as ray.serve.llm:build_openai_app .
applications:
- args:
llm_configs:
- ./model_config/new_llm_config.yaml ## Update to the new llm config file path.
import_path: ray.serve.llm:build_openai_app ## Updated to use ray.serve.llm
name: llm-endpoint
route_prefix: /
query_auth_token_enabled: true

Step 5. Deploy the new LLM service

Finally, deploy the service using the Anyscale command:

anyscale service deploy -f serve_20250311213249.yaml

Note:

Alternatively, the new Ray Serve LLM API also enables you to deploy the LLM service directly from a Python script. For more details, see: https://docs.ray.io/en/latest/serve/llm/serving-llms.html


Final remarks

This guide provides a structured pathway to transition your deployment from the RayLLM API to the new Ray Serve LLM API. Following these steps ensures that you transfer all key configuration fields accurately to the new API.

Feel free to go to the following links for further details and updates:

Reach out to Anyscale customer support if you have any questions.