LLMApp Configs API
This document describes the API for the RayLLM application model.
LLMConfig
Name | Type | Default | Description |
---|---|---|---|
accelerator_type | str | N/A | The type of accelerator run the model on. Only the following values are supported: ['A10', 'L4', 'A100_40G', 'A100_80G', 'H100'] |
deployment_config | DeploymentConfig | N/A | The Ray Serve deployment settings for the model deployment. |
engine_kwargs | dict[str, Any] | Additional keyword arguments for the engine. In case of vLLM, this will include all the configuration knobs they provide out of the box, except for tensor-parallelism which is set automatically from Ray Serve configs. | |
generation_config | GenerationConfig | N/A | The settings for how to adjust the prompt and interpret tokens. |
input_modality | InputModality | InputModality.text | The type of request that can be submitted to this model. |
json_mode | JSONModeConfig | N/A | Settings for JSON mode. |
llm_engine | LLMEngine | N/A | The LLMEngine that should be used to run the model. |
lora_config | Optional[LoraConfig] | None | Settings for LoRA adapter. |
max_request_context_length | int | 2048 | The maximum amount of tokens (input+generated) the model can handle per request. This must be less than or equal to the model context length. The model context length can be obtained from the model card. Requests with more tokens than this will be rejected. |
model_loading_config | ModelLoadingConfig | N/A | The settings for how to download and expose the model. |
runtime_env | Optional[dict[str, Any]] | None | The runtime_env to use for the model deployment replica and the engine workers. |
tensor_parallelism | TensorParallelismConfig | See model defaults | The tensor parallelism settings for the model. |
ModelLoadingConfig
Name | Type | Default | Description |
---|---|---|---|
anytensor_config | Optional[AnytensorConfig] | None | Configuration to use Anytensor for improved model loading speed. Only the model weights will be loaded using Anytensor; the tokenizer and extra files will still be pulled from HuggingFace or the S3/GCS mirror. |
model_id | str | N/A | The ID that should be used by end users to access this model. |
model_source | Union[str, S3MirrorConfig, GCSMirrorConfig] | N/A | Where to obtain the model weights from. Should be a HuggingFace model ID, S3 mirror config, or GCS mirror config. |
tokenizer_source | Optional[str] | None | Where to obtain the tokenizer from. If None, tokenizer is obtained from the model source. Only HuggingFace IDs are supported for now. |
S3MirrorConfig
Name | Type | Default | Description |
---|---|---|---|
bucket_uri | Optional[str] | None | |
extra_files | list[ExtraFiles] | [] | |
s3_aws_credentials | Optional[S3AWSCredentials] | None | |
s3_sync_args | Optional[list[str]] | None |
ExtraFiles
Name | Type | Default | Description |
---|---|---|---|
bucket_uri | str | N/A | |
destination_path | str | N/A |
S3AWSCredentials
Name | Type | Default | Description |
---|---|---|---|
auth_token_env_variable | Optional[str] | None | |
create_aws_credentials_url | str | N/A |
GCSMirrorConfig
Name | Type | Default | Description |
---|---|---|---|
bucket_uri | Optional[str] | None | |
extra_files | list[ExtraFiles] | [] |
AnytensorConfig
Name | Type | Default | Description |
---|---|---|---|
model_path | str | N/A |
GenerationConfig
Name | Type | Default | Description |
---|---|---|---|
generate_kwargs | dict[str, Any] | Extra generation kwargs that needs to be passed into the sampling stage for the deployment (this includes things like temperature, etc.) | |
prompt_format | Optional[Union[PromptFormat, VisionPromptFormat]] | None | Handles chat template formatting and tokenization. If None, prompt formatting will be disabled and the model can be only queried in the completion mode. |
stopping_sequences | Optional[list[Union[str, int, list[Union[str, int]]]]] | None | Stopping sequences to propagate for inference. By default, we use EOS/UNK tokens at inference. |
PromptFormat
Name | Type | Default | Description |
---|---|---|---|
add_system_tags_even_if_message_is_empty | bool | False | If True, the system message will be included in the prompt even if the content of the system message is empty. |
assistant | str | N/A | The template for the assistant message. This is used when the input list of messages includes assistant messages. The content of those messages is reformatted with this template. It should include {instruction} template and if tool_calls is not empty, it should also include {tool_calls} template. |
bos | str | The string that should be prepended to the text before sending it to the model for completion. Defaults to empty string | |
default_system_message | str | The default system message that should be included in the prompt if no system message is provided in the input list of messages. If not specified, this is an empty string | |
strip_whitespace | bool | True | If True, the whitespace in the content of the messages will be stripped. |
system | str | N/A | The template for system message. It should include {instruction} template. |
system_in_last_user | bool | False | (Inference only) If True, the system message will be included in the last user message. Otherwise, it will be included in the first user message. This is not used during fine-tuning. |
system_in_user | bool | False | If True, the system message will be included in the user message. |
tool | str | The special role whose content captures the output of the called functions. It should include {instruction} template. | |
tool_calls | str | The template for how the previously called tools should be presented in the assistant message. It should include {instruction} template. | |
tools_list | str | The template for how the list of available tools should be presented in the user message. It should include {instruction} template. | |
tools_list_in_last_user | bool | True | If True, the tools list will be included in the user message. |
tools_list_in_user | bool | True | If True, the tools list will be included in the user message. |
trailing_assistant | str | (Inference only) The string that should be appended to the end of the text before sending it to the model for completion at inference time. This is not used during fine-tuning. | |
user | str | N/A | The template for the user message. It should include {instruction} template. If system_in_user is set to True, it should also include {system} template. If tools_list_in_user is set to True, it should also include {tools_list} template. |
VisionPromptFormat
Name | Type | Default | Description |
---|---|---|---|
add_system_tags_even_if_message_is_empty | bool | False | If True, the system message will be included in the prompt even if the content of the system message is empty. |
assistant | str | N/A | The template for the assistant message. This is used when the input list of messages includes assistant messages. The content of those messages is reformatted with this template. It should include {instruction} template and if tool_calls is not empty, it should also include {tool_calls} template. |
bos | str | The string that should be prepended to the text before sending it to the model for completion. Defaults to empty string | |
default_system_message | str | The default system message that should be included in the prompt if no system message is provided in the input list of messages. If not specified, this is an empty string | |
strip_whitespace | bool | True | If True, the whitespace in the content of the messages will be stripped. |
system | str | N/A | The template for system message. It should include {instruction} template. |
system_in_last_user | bool | False | (Inference only) If True, the system message will be included in the last user message. Otherwise, it will be included in the first user message. This is not used during fine-tuning. |
system_in_user | bool | False | If True, the system message will be included in the user message. |
tool | str | The special role whose content captures the output of the called functions. It should include {instruction} template. | |
tool_calls | str | The template for how the previously called tools should be presented in the assistant message. It should include {instruction} template. | |
tools_list | str | The template for how the list of available tools should be presented in the user message. It should include {instruction} template. | |
tools_list_in_last_user | bool | True | If True, the tools list will be included in the user message. |
tools_list_in_user | bool | True | If True, the tools list will be included in the user message. |
trailing_assistant | str | (Inference only) The string that should be appended to the end of the text before sending it to the model for completion at inference time. This is not used during fine-tuning. | |
user | str | N/A | The template for the user message. It should include {instruction} template. If system_in_user is set to True, it should also include {system} template. If tools_list_in_user is set to True, it should also include {tools_list} template. |
vision | bool | True |
LLMEngine
Enum Name | Value |
---|---|
VLLM | VLLMEngine |
InputModality
Enum Name | Value |
---|---|
text | text |
image | image |
TensorParallelismConfig
Name | Type | Default | Description |
---|---|---|---|
degree | int | 1 | The degree of tensor parallelism. Must be greater than or equal to 1. When set to 1, the model does not use tensor parallelism. |
JSONModeConfig
Name | Type | Default | Description |
---|---|---|---|
enabled | bool | False | Whether JSON mode should be enabled on this model. |
options | Optional[JSONModeOptions] | None | Extra options to configure JSON mode behavior. |
JSONModeOptions
Name | Type | Default | Description |
---|---|---|---|
num_processes | int | 32 | The number of background processes for each replica. |
recreate_failed_actors | bool | True | Whether to restart failed JSON mode actors. |
LoraConfig
Name | Type | Default | Description |
---|---|---|---|
download_timeout_s | Optional[float] | 30.0 | How much time the download subprocess has to download a single LoRA before a timeout. None means no timeout. |
dynamic_lora_loading_path | Optional[str] | None | Cloud storage path where LoRA adapter weights are stored. |
max_download_tries | int | 3 | The maximum number of download retries. |
max_num_adapters_per_replica | int | 16 | The maximum number of adapters load on each replica. |
DeploymentConfig
Name | Type | Default | Description |
---|---|---|---|
autoscaling_config | Optional[AutoscalingConfig] | See model defaults | Configuration for autoscaling the number of workers |
graceful_shutdown_timeout_s | int | 300 | Controller waits for this duration to forcefully kill the replica for shutdown, in seconds. |
max_concurrent_queries | Optional[int] | None | This field is deprecated. max_ongoing_requests should be used instead. |
max_ongoing_requests | Optional[int] | None | Sets the maximum number of queries in flight that are sent to a single replica. |
AutoscalingConfig
Name | Type | Default | Description |
---|---|---|---|
downscale_delay_s | float | 300.0 | How long to wait before scaling down replicas, in seconds. |
initial_replicas | int | 1 | The number of replicas that are started initially for the deployment. |
look_back_period_s | float | 30.0 | Time window to average over for metrics, in seconds. |
max_replicas | int | 100 | max_replicas is the maximum number of replicas for the deployment. |
metrics_interval_s | float | 10.0 | How often to scrape for metrics in seconds. |
min_replicas | int | 1 | min_replicas is the minimum number of replicas for the deployment. |
target_num_ongoing_requests_per_replica | Optional[int] | None | target_num_ongoing_requests_per_replica is the deprecated field.If it is set, the model will set target_ongoing_requests to that value too.If neither field is set, _DEFAULT_TARGET_ONGOING_REQUESTS will be used. |
target_ongoing_requests | Optional[int] | None | target_ongoing_requests is the maximum number of queries that are sent to a replica of this deployment without receiving a response. |
upscale_delay_s | float | 10.0 | How long to wait before scaling up replicas, in seconds. |