Bring Your Own Models

RayLLM supports all models supported by vLLM. You can either bring a model from HuggingFace or artifact storage like S3, GCS. This doc page covers two important elements of configuration needed while adding custom models:

Model source
Prompt format

Model source

We generally recommend running the starter script rayllm gen-config in the Deploy LLM template, and following the flow to add a custom model. Here are the sample configurations if you are constructing your own YAML files. By default, if there is no source provided RayLLM will try to download the model based on model_id from HuggingFace.

HuggingFace
AWS S3
Google Cloud Storage

model_loading_config:
  model_id: meta-llama/Meta-Llama-3-70B-Instruct # model query name in ChatCompletion API
  model_source: meta-llama/Meta-Llama-3-70B-Instruct # HuggingFace model

model_loading_config:
  model_id: meta-llama/Meta-Llama-3-70B-Instruct
  model_source:
    bucket_uri: s3://…
    s3_sync_args: ["--region", "us-west-2"]

model_loading_config:
  model_id: meta-llama/Meta-Llama-3-70B-Instruct
  model_source:
    bucket_uri: gs://…

Expected files - Model and tokenizer files (config.json, tokenizer_config.json, .bin/.safetensors files), where the model weights must be stored in the safetensors format.

Note: this doc page covers adding custom base models. For serving LoRA adapters, see the multi-lora guide.

Prompt Format

A prompt format is used to convert a chat completions API input into a prompt to feed into the LLM engine. The format is a dictionary where the keys refers to one of the chat actors and the values are a string template for converting the chat messages to a formatted string. Each message in the API input is formatted into a string and these strings are concatenated together to form the final prompt.

The string template should include the {instruction} keyword, which will be replaced with message content from the ChatCompletions API.

Example prompt formats

Llama 3 Instruct models
Mistral-based model

prompt_format:
  system: "<|start_header_id|>system<|end_header_id|>\n\n{instruction}<|eot_id|>"
  assistant: "<|start_header_id|>assistant<|end_header_id|>\n\n{instruction}<|eot_id|>"
  trailing_assistant: "<|start_header_id|>assistant<|end_header_id|>\n\n"
  user: "<|start_header_id|>user<|end_header_id|>\n\n{instruction}<|eot_id|>"
  system_in_user: false
  bos: "<|begin_of_text|>"
  default_system_message: ""
stopping_sequences: ["<|end_of_text|>", "<|eot_id|>"]

prompt_format:
  system: "{instruction} + "
  assistant: "{instruction}</s> "
  trailing_assistant: ""
  user: "[INST] {system}{instruction} [/INST]"
  system_in_user: true
  default_system_message: "Always assist with care, respect, and truth. Respond with utmost utility yet securely. Avoid harmful, unethical, prejudiced, or negative content. Ensure replies promote fairness and positivity."
stopping_sequences: []

For example, if a user sends the following message for Llama3:

{
  "messages": [
    {
      "role": "system",
      "content": "You are a helpful assistant."
    },
    {
      "role": "user",
      "content": "What is the capital of France?"
    },
    {
      "role": "assistant",
      "content": "The capital of France is Paris."
    },
    {
      "role": "user",
      "content": "What about Germany?"
    }
  ]
}

The generated prompt that is sent to the LLM engine will be:

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are a helpful assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>

What is the capital of France?<|eot_id|><|start_header_id|>assistant<|end_header_id|>

The capital of France is Paris.<|eot_id|><|start_header_id|>user<|end_header_id|>

What about Germany?<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Schema

bos: The string that should be prepended to the text before sending it to the model for completion.
system: The system message template. It should include {instruction} template. The instruction will be replaced with the content of messages with role as system in the ChatCompletions API.
assistant: The assistant message template. It should include {instruction} template. The instruction will be replaced with the content of messages with role as assistant in the ChatCompletions API.
trailing_assistant: The special characters that will be added to the end of the prompt before sending it to the LLM for generation. This often includes special characters that put the LLM into assistant mode (granted that model has been trained to support such special keywords). For example Llama-3 Instruct requires <|start_header_id|>assistant<|end_header_id|> at the end of the prompt.
user: The user message template. It should include {instruction} template. The instruction will be replaced with the content of messages with role as user in the ChatCompletions API. If system_in_user is set to True, it should also include {system} template which implies that the system formatted message will be included in this user formatted prompt.

In addition, there some configurations to control the prompt formatting behavior:

default_system_message: The default system message. This system message is used by default if one is not provided in the ChatCompletions API.
system_in_user: Whether the system prompt should be included in the user prompt. If true, the user template should include {system}.
add_system_tags_even_if_message_is_empty: If True, the system message tags will be included in the prompt even if the content of the system message in the ChatCompletions API and default system message are empty.
strip_whitespace: Whether to automatically strip whitespace from left and right of the content for the messages provided in the ChatCompletions API.

Validating Prompt Format

RayLLM comes with a utility CLI, so that you can validate the input prompt to the LLM with given chat messages. In order to use this utility, create the model config YAML file and a JSONL file that contains one chat transcript. The output would print a formatted prompt.

CLI
message.jsonl
Output

rayllm format-prompt --model meta-llama--Meta-Llama-3_1-8B-Instruct.yaml --message-file message.jsonl

{"role": "system", "content": "You are a helpful assistant."}
{"role": "user", "content": "What is the capital of France?"}
{"role": "assistant", "content": "The capital of France is Paris."}
{"role": "user", "content": "What about Germany?"}

'<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are a helpful assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nWhat is the capital of France?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nThe capital of France is Paris.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nWhat about Germany?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n'

Prompt format for Non-Instruct Models

Here's an example of how to setup the prompt format for Llama-3 models that are not instruction tuned:

meta-llama/Meta-Llama-3.1-8B-Instruct
meta-llama/Meta-Llama-3.1-8B

prompt_format:
  system: "<|start_header_id|>system<|end_header_id|>\n\n{instruction}<|eot_id|>"
  assistant: "<|start_header_id|>assistant<|end_header_id|>\n\n{instruction}<|eot_id|>"
  trailing_assistant: "<|start_header_id|>assistant<|end_header_id|>\n\n"
  user: "<|start_header_id|>user<|end_header_id|>\n\n{instruction}<|eot_id|>"
  system_in_user: false
  bos: "<|begin_of_text|>"
  default_system_message: ""
stopping_sequences: ["<|end_of_text|>", "<|eot_id|>"]

prompt_format:
  bos: "<|begin_of_text|>"
  system: "{instruction} "
  assistant: "{instruction} "
  user: "{instruction} "
  system_in_user: false
  default_system_message: ""
stopping_sequences: ["<|end_of_text|>", "<|eot_id|>"]

Model source​

Prompt Format​

Example prompt formats​

Schema​

Validating Prompt Format​