Skip to main content
Version: Latest

LLMApp Configs API

This document describes the API for the RayLLM application model.

LLMConfig

NameTypeDefaultDescription
accelerator_typestrN/AThe type of accelerator run the model on. Only the following values are supported: ['A10', 'L4', 'A100_40G', 'A100_80G', 'H100']
deployment_configDeploymentConfigN/AThe Ray Serve deployment settings for the model deployment.
engine_kwargsdict[str, Any]Additional keyword arguments for the engine. In case of vLLM, this will include all the configuration knobs they provide out of the box, except for tensor-parallelism which is set automatically from Ray Serve configs.
generation_configGenerationConfigN/AThe settings for how to adjust the prompt and interpret tokens.
input_modalityInputModalityInputModality.textThe type of request that can be submitted to this model.
json_modeJSONModeConfigN/ASettings for JSON mode.
llm_engineLLMEngineN/AThe LLMEngine that should be used to run the model.
lora_configOptional[LoraConfig]NoneSettings for LoRA adapter.
max_request_context_lengthint2048The maximum amount of tokens (input+generated) the model can handle per request. This must be less than or equal to the model context length. The model context length can be obtained from the model card. Requests with more tokens than this will be rejected.
model_loading_configModelLoadingConfigN/AThe settings for how to download and expose the model.
runtime_envOptional[dict[str, Any]]NoneThe runtime_env to use for the model deployment replica and the engine workers.
tensor_parallelismTensorParallelismConfigSee model defaultsThe tensor parallelism settings for the model.

ModelLoadingConfig

NameTypeDefaultDescription
anytensor_configOptional[AnytensorConfig]NoneConfiguration to use Anytensor for improved model loading speed. Only the model weights will be loaded using Anytensor; the tokenizer and extra files will still be pulled from HuggingFace or the S3/GCS mirror.
model_idstrN/AThe ID that should be used by end users to access this model.
model_sourceUnion[str, S3MirrorConfig, GCSMirrorConfig]N/AWhere to obtain the model weights from. Should be a HuggingFace model ID, S3 mirror config, or GCS mirror config.
tokenizer_sourceOptional[str]NoneWhere to obtain the tokenizer from. If None, tokenizer is obtained from the model source. Only HuggingFace IDs are supported for now.

S3MirrorConfig

NameTypeDefaultDescription
bucket_uriOptional[str]None
extra_fileslist[ExtraFiles][]
s3_aws_credentialsOptional[S3AWSCredentials]None
s3_sync_argsOptional[list[str]]None

ExtraFiles

NameTypeDefaultDescription
bucket_uristrN/A
destination_pathstrN/A

S3AWSCredentials

NameTypeDefaultDescription
auth_token_env_variableOptional[str]None
create_aws_credentials_urlstrN/A

GCSMirrorConfig

NameTypeDefaultDescription
bucket_uriOptional[str]None
extra_fileslist[ExtraFiles][]

AnytensorConfig

NameTypeDefaultDescription
model_pathstrN/A

GenerationConfig

NameTypeDefaultDescription
generate_kwargsdict[str, Any]Extra generation kwargs that needs to be passed into the sampling stage for the deployment (this includes things like temperature, etc.)
prompt_formatOptional[Union[PromptFormat, VisionPromptFormat]]NoneHandles chat template formatting and tokenization. If None, prompt formatting will be disabled and the model can be only queried in the completion mode.
stopping_sequencesOptional[list[Union[str, int, list[Union[str, int]]]]]NoneStopping sequences to propagate for inference. By default, we use EOS/UNK tokens at inference.

PromptFormat

NameTypeDefaultDescription
add_system_tags_even_if_message_is_emptyboolFalseIf True, the system message will be included in the prompt even if the content of the system message is empty.
assistantstrN/AThe template for the assistant message. This is used when the input list of messages includes assistant messages. The content of those messages is reformatted with this template. It should include {instruction} template and if tool_calls is not empty, it should also include {tool_calls} template.
bosstrThe string that should be prepended to the text before sending it to the model for completion. Defaults to empty string
default_system_messagestrThe default system message that should be included in the prompt if no system message is provided in the input list of messages. If not specified, this is an empty string
strip_whitespaceboolTrueIf True, the whitespace in the content of the messages will be stripped.
systemstrN/AThe template for system message. It should include {instruction} template.
system_in_last_userboolFalse(Inference only) If True, the system message will be included in the last user message. Otherwise, it will be included in the first user message. This is not used during fine-tuning.
system_in_userboolFalseIf True, the system message will be included in the user message.
toolstrThe special role whose content captures the output of the called functions. It should include {instruction} template.
tool_callsstrThe template for how the previously called tools should be presented in the assistant message. It should include {instruction} template.
tools_liststrThe template for how the list of available tools should be presented in the user message. It should include {instruction} template.
tools_list_in_last_userboolTrueIf True, the tools list will be included in the user message.
tools_list_in_userboolTrueIf True, the tools list will be included in the user message.
trailing_assistantstr(Inference only) The string that should be appended to the end of the text before sending it to the model for completion at inference time. This is not used during fine-tuning.
userstrN/AThe template for the user message. It should include {instruction} template. If system_in_user is set to True, it should also include {system} template. If tools_list_in_user is set to True, it should also include {tools_list} template.

VisionPromptFormat

NameTypeDefaultDescription
add_system_tags_even_if_message_is_emptyboolFalseIf True, the system message will be included in the prompt even if the content of the system message is empty.
assistantstrN/AThe template for the assistant message. This is used when the input list of messages includes assistant messages. The content of those messages is reformatted with this template. It should include {instruction} template and if tool_calls is not empty, it should also include {tool_calls} template.
bosstrThe string that should be prepended to the text before sending it to the model for completion. Defaults to empty string
default_system_messagestrThe default system message that should be included in the prompt if no system message is provided in the input list of messages. If not specified, this is an empty string
strip_whitespaceboolTrueIf True, the whitespace in the content of the messages will be stripped.
systemstrN/AThe template for system message. It should include {instruction} template.
system_in_last_userboolFalse(Inference only) If True, the system message will be included in the last user message. Otherwise, it will be included in the first user message. This is not used during fine-tuning.
system_in_userboolFalseIf True, the system message will be included in the user message.
toolstrThe special role whose content captures the output of the called functions. It should include {instruction} template.
tool_callsstrThe template for how the previously called tools should be presented in the assistant message. It should include {instruction} template.
tools_liststrThe template for how the list of available tools should be presented in the user message. It should include {instruction} template.
tools_list_in_last_userboolTrueIf True, the tools list will be included in the user message.
tools_list_in_userboolTrueIf True, the tools list will be included in the user message.
trailing_assistantstr(Inference only) The string that should be appended to the end of the text before sending it to the model for completion at inference time. This is not used during fine-tuning.
userstrN/AThe template for the user message. It should include {instruction} template. If system_in_user is set to True, it should also include {system} template. If tools_list_in_user is set to True, it should also include {tools_list} template.
visionboolTrue

LLMEngine

Enum NameValue
VLLMVLLMEngine

InputModality

Enum NameValue
texttext
imageimage

TensorParallelismConfig

NameTypeDefaultDescription
degreeint1The degree of tensor parallelism. Must be greater than or equal to 1. When set to 1, the model does not use tensor parallelism.

JSONModeConfig

NameTypeDefaultDescription
enabledboolFalseWhether JSON mode should be enabled on this model.
optionsOptional[JSONModeOptions]NoneExtra options to configure JSON mode behavior.

JSONModeOptions

NameTypeDefaultDescription
num_processesint32The number of background processes for each replica.
recreate_failed_actorsboolTrueWhether to restart failed JSON mode actors.

LoraConfig

NameTypeDefaultDescription
download_timeout_sOptional[float]30.0How much time the download subprocess has to download a single LoRA before a timeout. None means no timeout.
dynamic_lora_loading_pathOptional[str]NoneCloud storage path where LoRA adapter weights are stored.
max_download_triesint3The maximum number of download retries.
max_num_adapters_per_replicaint16The maximum number of adapters load on each replica.

DeploymentConfig

NameTypeDefaultDescription
autoscaling_configOptional[AutoscalingConfig]See model defaultsConfiguration for autoscaling the number of workers
graceful_shutdown_timeout_sint300Controller waits for this duration to forcefully kill the replica for shutdown, in seconds.
max_concurrent_queriesOptional[int]NoneThis field is deprecated. max_ongoing_requests should be used instead.
max_ongoing_requestsOptional[int]NoneSets the maximum number of queries in flight that are sent to a single replica.

AutoscalingConfig

NameTypeDefaultDescription
downscale_delay_sfloat300.0How long to wait before scaling down replicas, in seconds.
initial_replicasint1The number of replicas that are started initially for the deployment.
look_back_period_sfloat30.0Time window to average over for metrics, in seconds.
max_replicasint100max_replicas is the maximum number of replicas for the deployment.
metrics_interval_sfloat10.0How often to scrape for metrics in seconds.
min_replicasint1min_replicas is the minimum number of replicas for the deployment.
target_num_ongoing_requests_per_replicaOptional[int]Nonetarget_num_ongoing_requests_per_replica is the deprecated field.If it is set, the model will set target_ongoing_requests to that value too.If neither field is set, _DEFAULT_TARGET_ONGOING_REQUESTS will be used.
target_ongoing_requestsOptional[int]Nonetarget_ongoing_requests is the maximum number of queries that are sent to a replica of this deployment without receiving a response.
upscale_delay_sfloat10.0How long to wait before scaling up replicas, in seconds.