Deprecating soon

RayLLM is deprecating soon: The Ray Team is consolidating around open source online inferencing solutions. Ray Serve LLM provides LLM serving solution that makes it easy to deploy and manage a variety of open source LLMs. See the migration guide for transitioning your workflows.

RayLLM Configs API

This document describes the API for the RayLLM application model.

`RayLLMArgs`

Name	Type	Default	Description
`llm_configs`	list[Union[str, LLMConfig]]	N/A	A list of LLMConfigs, or paths to LLMConfigs, to run.

`LLMConfig`

Name	Type	Default	Description
`accelerator_type`	str	N/A	The type of accelerator run the model on. Only the following values are supported: ['A10', 'L4', 'L40S', 'A100_40G', 'A100_80G', 'H100']
`deployment_config`	DeploymentConfig	N/A	The Ray Serve deployment settings for the model deployment.
`engine_kwargs`	dict[str, Any]		Additional keyword arguments for the engine. In case of vLLM, this will include all the configuration knobs they provide out of the box, except for tensor-parallelism which is set automatically from Ray Serve configs.
`generation_config`	GenerationConfig	N/A	The settings for how to adjust the prompt and interpret tokens.
`input_modality`	InputModality	InputModality.text	The type of request that can be submitted to this model.
`json_mode`	JSONModeConfig	N/A	Settings for JSON mode.
`llm_engine`	LLMEngine	N/A	The LLMEngine that should be used to run the model.
`lora_config`	Optional[LoraConfig]	None	Settings for LoRA adapter.
`max_request_context_length`	int	2048	The maximum amount of tokens (input+generated) the model can handle per request. This must be less than or equal to model context length. The model context length can be obtained from the model card. Requests with more tokens than this will be rejected.
`model_loading_config`	ModelLoadingConfig	N/A	The settings for how to download and expose the model.
`runtime_env`	Optional[dict[str, Any]]	None	The runtime_env to use for the model deployment replica and the engine workers.
`tensor_parallelism`	TensorParallelismConfig	See model defaults	The tensor parallelism settings for the model.

`ModelLoadingConfig`

Name	Type	Default	Description
`anytensor_config`	Optional[AnytensorConfig]	None	Configuration to use Anytensor for improved model loading speed. Only the model weights will be loaded using Anytensor; the tokenizer and extra files will still be pulled from HuggingFace or the S3/GCS mirror.
`model_id`	str	N/A	The ID that should be used by end users to access this model.
`model_source`	Optional[Union[str, S3MirrorConfig, GCSMirrorConfig]]	None	Where to obtain the model weights from. Should be a HuggingFace model ID, S3 mirror config, or GCS mirror config. When omitted, defaults to the model_id as a HuggingFace model ID.
`tokenizer_source`	Optional[str]	None	Where to obtain the tokenizer from. If None, tokenizer is obtained from the model source. Only HuggingFace IDs are supported for now.

`S3MirrorConfig`

Name	Type	Default
`bucket_uri`	Optional[str]	None
`extra_files`	list[ExtraFiles]	[]
`s3_aws_credentials`	Optional[S3AWSCredentials]	None
`s3_sync_args`	Optional[list[str]]	None

`ExtraFiles`

Name	Type	Default	Description
`bucket_uri`	str	N/A
`destination_path`	str	N/A

`S3AWSCredentials`

Name	Type	Default	Description
`auth_token_env_variable`	Optional[str]	None
`create_aws_credentials_url`	str	N/A

`GCSMirrorConfig`

Name	Type	Default	Description
`bucket_uri`	Optional[str]	None
`extra_files`	list[ExtraFiles]	[]

`AnytensorConfig`

Name	Type	Default	Description
`model_path`	str	N/A

`GenerationConfig`

Name	Type	Default	Description
`generate_kwargs`	dict[str, Any]		Extra generation kwargs that needs to be passed into the sampling stage for the deployment (this includes things like temperature, etc.)
`prompt_format`	Optional[Union[PromptFormat, VisionPromptFormat]]	None	Handles chat template formatting and tokenization. If None, prompt formatting will be disabled and the model can be only queried in the completion mode.
`stopping_sequences`	Optional[list[str]]	None	Stopping sequences (applied after detokenization) to propagate for inference.
`stopping_tokens`	Optional[list[int]]	[]	Stopping tokens (applied before detokenization) to propagate for inference. By default, we use EOS/UNK tokens at inference.

`PromptFormat`

Name	Type	Default	Description
`add_system_tags_even_if_message_is_empty`	bool	False	If True, the system message will be included in the prompt even if the content of the system message is empty.
`assistant`	str	N/A	The template for the assistant message. This is used when the input list of messages includes assistant messages. The content of those messages is reformatted with this template. It should include `{instruction}` template and if `tool_calls` is not empty, it should also include `{tool_calls}` template.
`bos`	str		The string that should be prepended to the text before sending it to the model for completion. Defaults to empty string
`default_system_message`	str		The default system message that should be included in the prompt if no system message is provided in the input list of messages. If not specified, this is an empty string
`strip_whitespace`	bool	True	If True, the whitespace in the content of the messages will be stripped.
`system`	str	N/A	The template for system message. It should include `{instruction}` template.
`system_in_last_user`	bool	False	(Inference only) If True, the system message will be included in the last user message. Otherwise, it will be included in the first user message. This is not used during fine-tuning.
`system_in_user`	bool	False	If True, the system message will be included in the user message.
`tool`	str		The special role whose content captures the output of the called functions. It should include `{instruction}` template.
`tool_calls`	str		The template for how the previously called tools should be presented in the assistant message. It should include `{instruction}` template.
`tools_list`	str		The template for how the list of available tools should be presented in the user message. It should include `{instruction}` template.
`tools_list_in_last_user`	bool	True	If True, the tools list will be included in the user message.
`tools_list_in_user`	bool	True	If True, the tools list will be included in the user message.
`trailing_assistant`	str		(Inference only) The string that should be appended to the end of the text before sending it to the model for completion at inference time. This is not used during fine-tuning.
`user`	str	N/A	The template for the user message. It should include `{instruction}` template. If `system_in_user` is set to True, it should also include `{system}` template. If `tools_list_in_user` is set to True, it should also include `{tools_list}` template.

`VisionPromptFormat`

Name	Type	Default	Description
`add_system_tags_even_if_message_is_empty`	bool	False	If True, the system message will be included in the prompt even if the content of the system message is empty.
`assistant`	str	N/A	The template for the assistant message. This is used when the input list of messages includes assistant messages. The content of those messages is reformatted with this template. It should include `{instruction}` template and if `tool_calls` is not empty, it should also include `{tool_calls}` template.
`bos`	str		The string that should be prepended to the text before sending it to the model for completion. Defaults to empty string
`default_system_message`	str		The default system message that should be included in the prompt if no system message is provided in the input list of messages. If not specified, this is an empty string
`strip_whitespace`	bool	True	If True, the whitespace in the content of the messages will be stripped.
`system`	str	N/A	The template for system message. It should include `{instruction}` template.
`system_in_last_user`	bool	False	(Inference only) If True, the system message will be included in the last user message. Otherwise, it will be included in the first user message. This is not used during fine-tuning.
`system_in_user`	bool	False	If True, the system message will be included in the user message.
`tool`	str		The special role whose content captures the output of the called functions. It should include `{instruction}` template.
`tool_calls`	str		The template for how the previously called tools should be presented in the assistant message. It should include `{instruction}` template.
`tools_list`	str		The template for how the list of available tools should be presented in the user message. It should include `{instruction}` template.
`tools_list_in_last_user`	bool	True	If True, the tools list will be included in the user message.
`tools_list_in_user`	bool	True	If True, the tools list will be included in the user message.
`trailing_assistant`	str		(Inference only) The string that should be appended to the end of the text before sending it to the model for completion at inference time. This is not used during fine-tuning.
`user`	str	N/A	The template for the user message. It should include `{instruction}` template. If `system_in_user` is set to True, it should also include `{system}` template. If `tools_list_in_user` is set to True, it should also include `{tools_list}` template.
`vision`	bool	True

`LLMEngine`

Enum Name	Value
VLLM	VLLMEngine

`InputModality`

Enum Name	Value
text	text
image	image

`TensorParallelismConfig`

Name	Type	Default	Description
`degree`	int	1	The degree of tensor parallelism. Must be greater than or equal to 1. When set to 1, the model does not use tensor parallelism.

`JSONModeConfig`

Name	Type	Default	Description
`enabled`	bool	False	Whether JSON mode should be enabled on this model.
`options`	Optional[JSONModeOptions]	None	Extra options to configure JSON mode behavior.

`JSONModeOptions`

Name	Type	Default	Description
`num_processes`	int	8	The number of background processes for each replica.
`recreate_failed_actors`	bool	True	Whether to restart failed JSON mode actors.

`LoraConfig`

Name	Type	Default	Description
`download_timeout_s`	Optional[float]	30.0	How much time the download subprocess has to download a single LoRA before a timeout. None means no timeout.
`dynamic_lora_loading_path`	Optional[str]	None	Cloud storage path where LoRA adapter weights are stored.
`max_download_tries`	int	3	The maximum number of download retries.
`max_num_adapters_per_replica`	int	16	The maximum number of adapters load on each replica.

`DeploymentConfig`

Name	Type	Default	Description
`autoscaling_config`	Optional[AutoscalingConfig]	See model defaults	Configuration for autoscaling the number of workers
`graceful_shutdown_timeout_s`	int	300	Controller waits for this duration to forcefully kill the replica for shutdown, in seconds.
`max_concurrent_queries`	Optional[int]	None	This field is deprecated. max_ongoing_requests should be used instead.
`max_ongoing_requests`	Optional[int]	None	Sets the maximum number of queries in flight that are sent to a single replica.

`AutoscalingConfig`

Name	Type	Default	Description
`downscale_delay_s`	float	300.0	How long to wait before scaling down replicas, in seconds.
`initial_replicas`	int	1	The number of replicas that are started initially for the deployment.
`look_back_period_s`	float	30.0	Time window to average over for metrics, in seconds.
`max_replicas`	int	100	max_replicas is the maximum number of replicas for the deployment.
`metrics_interval_s`	float	10.0	How often to scrape for metrics in seconds.
`min_replicas`	int	1	min_replicas is the minimum number of replicas for the deployment.
`target_num_ongoing_requests_per_replica`	Optional[int]	None	target_num_ongoing_requests_per_replica is the deprecated field.If it is set, the model will set target_ongoing_requests to that value too.If neither field is set, _DEFAULT_TARGET_ONGOING_REQUESTS will be used.
`target_ongoing_requests`	Optional[int]	None	target_ongoing_requests is the maximum number of queries that are sent to a replica of this deployment without receiving a response.
`upscale_delay_s`	float	10.0	How long to wait before scaling up replicas, in seconds.