Migrate from RayLLM to Ray Serve LLM API

This guide explains how to migrate your LLM deployment from the legacy RayLLM to the new Ray Serve LLM API (ray [serve, llm]≥2.44.0).

The overall deployment process remains the same; the primary change is that you need to update the LLM Model Config and Serve Config files to work with the new Ray Serve LLM API.

Migration summary table for `LLMConfig`

LLMConfig is a central configuration object used to map the LLM Model Config YAML file that drives the deployment. There are substantial changes in LLMConfig from the RayLLM to the new Ray Serve LLM API.

Below is an updated migration summary table that not only maps the common fields in LLMConfig from the RayLLM to the new Ray Serve LLM API, but also includes additional details for customizations you may have added.

Tip: you can start by running the migration script provided below to migrate the old LLM Model config file to the new one. Then, consult the migration summary table if you need additional details for specific configuration fields.

For additional reference, see:

Field	Description	Migration Action	Migration Details	Additional Notes
accelerator_type	The type of accelerator used to run the model.	Value updated	The new Ray Serve LLM API supports additional accelerator types, check the full list here: https://docs.ray.io/en/latest/serve/api/doc/ray.serve.llm.LLMConfig.html Accelerator Name Updates: `A10` changes to `A10G` `A100_40G` changes to `A100-40G` `A100_80G` changes to `A100-80G`	N/A
deployment_config	The Ray Serve deployment settings for the model deployment.	No change	N/A	N/A
engine_kwargs	Additional keyword arguments for the engine.	Added new fields	Support `tensor_parallel_size` and `max_model_len` fields in the `engine_kwargs`. Recommend to remove the `tokenizer_pool_size` field.	For full list of `engine_kwargs`: https://docs.vllm.ai/en/latest/serving/engine_args.html. Note: In vLLM v1, the `tokenizer_pool_size` option is removed because it does not offer any notable performance improvements. We recommend discontinuing its use. If you still need similar functionality, you can achieve it by setting `runtime_env: {}` in `tokenizer_pool_extra_config` to avoid potential issue. See: https://github.com/vllm-project/vllm/issues/16140.
generation_config	Settings for prompt formatting and token interpretation.	Removed	`generation_config` and `prompt_format` are no longer used. The new LLM Serve API imports configurations such as `prompt_format` directly from Hugging Face. Models coming from Huggingface ecosystem use a templating pattern outlined here (https://huggingface.co/docs/transformers/main/en/chat_templating)	If you previously used `stopping_sequences` or `stopping_tokens`, use the vLLM sampling parameter `stop_token_ids`, which defines tokens that halt generation upon appearance. See: vLLM Sampling Params. You can pass `extra_body={"stop_token_ids": stop_token_ids}` in the OpenAI chat client API to use these sampling parameters in vLLM.
input_modality	Defines the type of request that the model accepts.	Removed	Input modality is now automatically derived from the model.	N/A
json_mode	Configuration for JSON mode.	Removed	JSON mode is automatically enabled in the new Ray Serve LLM during deployment.	Set `response_format={"type": "json_object"}` in the OpenAI's chat client API. https://docs.vllm.ai/en/latest/features/structured_outputs.html for usage details. In addition, you can pass `extra_body={"guided_json": json_schema}` to include the JSON schema. For more structured outputs supported by vllm, see: https://docs.vllm.ai/en/latest/features/structured_outputs.html
llm_engine	The LLM engine used to run the model.	Kept, but restricted to `['vLLM']`.	This field is optional. If provided, only the parameter `vLLM` is accepted as valid.	N/A
lora_config	Settings for LoRA adapter.	No change*	There is no change on the configuration. LoRA weight storage path name requirements are simplified while maintaining compatibility with previous paths.	Previously, RayLLM stored LoRA weights were stored in paths that combined the base model ID with a custom suffix (for example `s3://dynamic_path/[base_model]:[suffix]`). Now, they are stored using a unified path structure where each LoRA weight has a unique ID directly under a common directory (for example `s3://dynamic_path/[lora_id]`). This change removes the dependency on the base model name, making storage more flexible and easier to manage. Check the example to learn how to deploy Multi-LoRA Deployment using the new LLM API: https://docs.ray.io/en/latest/serve/llm/serving-llms.html#multi-lora-deployment
max_request_context_length	Maximum tokens (input + generated) per request. Must be ≤ model context length.	Refactored	Use `max_model_len` field in the `engine_kwargs` instead.	Ray Serve LLM directly uses the `max_model_len` in the `engine_kwargs` field specified in the vLLM engine. https://docs.vllm.ai/en/latest/serving/engine_args.html
model_loading_config	Configuration for downloading and exposing the model.	No change	N/A	N/A
runtime_env	Runtime environment settings for deployment replicas and engine workers.	Env variable name changed	Replace `HUGGING_FACE_HUB_TOKEN` with `HF_TOKEN`.	As of Ray 2.43.0, both `HUGGING_FACE_HUB_TOKEN` and `HF_TOKEN` are supported, but `HUGGING_FACE_HUB_TOKEN` will be deprecated soon.
tensor_parallelism	Tensor parallelism settings for the model.	Refactored	Update `tensor_parallelism.degree` to `tensor_parallel_size` field in the `engine_kwargs` .	Ray Serve LLM directly uses the `tensor_parallel_size` in the `engine_kwargs` field specified in the vLLM engine. https://docs.vllm.ai/en/latest/serving/engine_args.html

Migration script for LLM model config

The following Python script automates most of the migration process. It reads your old LLM Model Config YAML file, applies the necessary changes, and writes out the new configuration file: new_llm_config.yaml.

import yaml
from pprint import pprint

config_path = "old_llm_config.yaml"  ## Replace with your old LLM config file.

with open(config_path, "r") as f:
    config_dict = yaml.safe_load(f)

# Define the whitelist of supported parameters for LLMConfig.
allowed_keys = {
    "accelerator_type",
    "deployment_config",
    "engine_kwargs",
    "llm_engine",
    "lora_config",
    "model_loading_config",
    "runtime_env",
}

# Filter the loaded config to only include allowed keys.
new_config = {key: value for key, value in config_dict.items() if key in allowed_keys}

# Update runtime_env to rename HUGGING_FACE_HUB_TOKEN to HF_TOKEN.
if "runtime_env" in new_config:
    env_vars = new_config["runtime_env"].get("env_vars", {})
    if "HUGGING_FACE_HUB_TOKEN" in env_vars:
        env_vars["HF_TOKEN"] = env_vars.pop("HUGGING_FACE_HUB_TOKEN")
    new_config["runtime_env"]["env_vars"] = env_vars
    
    
# Update accelerator_type if necessary using a mapping.
accelerator_mapping = {
    "A10": "A10G",
    "A100_40G": "A100-40G",
    "A100_80G": "A100-80G"
}
if "accelerator_type" in new_config:
    current = new_config["accelerator_type"]
    if current in accelerator_mapping:
        new_config["accelerator_type"] = accelerator_mapping[current]

# Refactor max_request_context_length to max_model_len inside engine_kwargs.
if "max_request_context_length" in config_dict:
    # Ensure engine_kwargs exists.
    if "engine_kwargs" not in new_config or new_config["engine_kwargs"] is None:
        new_config["engine_kwargs"] = {}
    new_config["engine_kwargs"]["max_model_len"] = config_dict["max_request_context_length"]

# If tensor_parallelism is specified in the YAML,
# add its "degree" value to engine_kwargs as tensor_parallel_size.
if "tensor_parallelism" in config_dict:
    degree = config_dict["tensor_parallelism"].get("degree", 1)
    # Ensure engine_kwargs exists.
    if "engine_kwargs" not in new_config or new_config["engine_kwargs"] is None:
        new_config["engine_kwargs"] = {}
    new_config["engine_kwargs"]["tensor_parallel_size"] = degree

# Remove tokenizer_pool_size from engine_kwargs if it exists.
if "engine_kwargs" in new_config and new_config["engine_kwargs"] is not None:
    new_config["engine_kwargs"].pop("tokenizer_pool_size", None)
    
# Force the llm_engine to be "vllm" (supported engine) regardless of the YAML content.
new_config["llm_engine"] = "vLLM"

print("The new LLM configuration is:")
pprint(new_config)

# Save the new configuration to a YAML file.
output_path = "new_llm_config.yaml" 
with open(output_path, "w") as outfile:
    yaml.safe_dump(new_config, outfile)

print(f"New configuration saved to {output_path}")

Step-by-step migration guide

Step 1: Verify prerequisites and install necessary packages

Before migrating, ensure your environment meets the requirements for the new Ray Serve LLM API (ray [serve, llm]≥2.44.0).

Alternatively, you can use the docker image such as anyscale/ray-llm:2.44.1-py311-cu124 which has all the necessary packages installed. Feel free to check the newest ray-llm docker image on Anyscale.

For more details, see:

https://docs.ray.io/en/latest/serve/llm/serving-llms.html

Step 2: Review old configurations in RayLLM

The old deployment uses two YAML files, one for Serve Config and one for LLM Model Config. Below are examples of the previous configuration files.

Old Serve Config (serve_20250311213249.yaml):

applications:
- args:
    llm_configs:
    - ./model_config/Qwen--Qwen2_5-32B_20250311213249.yaml
  import_path: rayllm:app
  name: llm-endpoint
  route_prefix: /
query_auth_token_enabled: true

Old LLM Model Config (Qwen--Qwen2_5-32B_20250311213249.yaml):

accelerator_type: A10
deployment_config:
  autoscaling_config:
    target_ongoing_requests: 32
  max_ongoing_requests: 64
engine_kwargs:
  max_num_batched_tokens: 8192
  max_num_seqs: 64
  tokenizer_pool_extra_config:
    runtime_env:
      pip: null
  tokenizer_pool_size: 2
  trust_remote_code: true
generation_config:
  prompt_format:
    use_hugging_face_chat_template: true
  stopping_sequences: []
  stopping_tokens: []
input_modality: text
llm_engine: VLLMMQEngine
lora_config: null
max_request_context_length: 8192
model_loading_config:
  model_id: Qwen/Qwen2.5-32B-Instruct
  model_source: Qwen/Qwen2.5-32B-Instruct
runtime_env:
  env_vars:
    HUGGING_FACE_HUB_TOKEN: <your_hf_token>
    VLLM_ALLOW_LONG_MAX_MODEL_LEN: '1'
tensor_parallelism:
  degree: 4

Step 3: Create the new LLM Configuration file using the migration script

Update the migration script's config_path variable with your old llm model configuration file's path.

config_path = "Qwen--Qwen2_5-32B_20250311213249.yaml" ## replace with your old llm config file

Run the migration script above and generate the new_llm_config.yaml as below:

accelerator_type: A10G
deployment_config:
  autoscaling_config:
    target_ongoing_requests: 32
  max_ongoing_requests: 64
engine_kwargs:
  max_model_len: 8192
  max_num_batched_tokens: 8192
  max_num_seqs: 64
  tensor_parallel_size: 4
  tokenizer_pool_extra_config:
    runtime_env:
      pip: null
  trust_remote_code: true
llm_engine: vLLM
lora_config: null
model_loading_config:
  model_id: Qwen/Qwen2.5-32B-Instruct
  model_source: Qwen/Qwen2.5-32B-Instruct
runtime_env:
  env_vars:
    HF_TOKEN: <your_hf_token>
    VLLM_ALLOW_LONG_MAX_MODEL_LEN: '1'

Here's how each field transitions:

engine_kwargs and tensor_parallelism:
- Action: Move degree: 4 from tensor_parallelism to tensor_parallel_size: 4 in engine_kwargs. Removed the tokenizer_pool_size.
generation_config:
- Action: Removed this field. Use extra_body={"stop_token_ids": [ids]} in the OpenAI client for stopping tokens if needed.
input_modality:
- Action: Remove this field; auto-derived from the model.
llm_engine:
- Action: Update to vLLM, as only this is supported now.
max_request_context_length: In the OpenAI client.
- Action: Move max_request_context_length to max_model_len in engine_kwargs.
runtime_env:
- Action: Rename the environment variable HUGGING_FACE_HUB_TOKEN as HF_TOKEN.

Step 4: Update the Serve config file

Update the path of your new LLM model configuration file (./model_config/new_llm_config.yaml).
In addition, update the import_path as ray.serve.llm:build_openai_app .

applications:
- args:
    llm_configs:
    - ./model_config/new_llm_config.yaml  ## Update to the new llm config file path.
  import_path: ray.serve.llm:build_openai_app  ## Updated to use ray.serve.llm
  name: llm-endpoint
  route_prefix: /
query_auth_token_enabled: true

Step 5. Deploy the new LLM service

Finally, deploy the service using the Anyscale command:

anyscale service deploy -f serve_20250311213249.yaml

Note:

Alternatively, the new Ray Serve LLM API also enables you to deploy the LLM service directly from a Python script. For more details, see: https://docs.ray.io/en/latest/serve/llm/serving-llms.html

Final remarks

This guide provides a structured pathway to transition your deployment from the RayLLM API to the new Ray Serve LLM API. Following these steps ensures that you transfer all key configuration fields accurately to the new API.

Feel free to go to the following links for further details and updates:

Reach out to Anyscale customer support if you have any questions.

Migration summary table for LLMConfig​

Migration script for LLM model config​

Step-by-step migration guide​

Step 1: Verify prerequisites and install necessary packages​

Step 2: Review old configurations in RayLLM​

Step 3: Create the new LLM Configuration file using the migration script​

Step 4: Update the Serve config file​

Step 5. Deploy the new LLM service​

Final remarks​