Skip to main content

Preparing your dataset

This guide focuses on how you can bring your own data to fine-tune your model using LLMForge.

Example YAML

This guide specifies training and validation paths in the train_path and valid_path entries in the config file as shown in the example YAML below. Validation path is optional.

model_id: meta-llama/Meta-Llama-3-8B-Instruct
train_path: s3://air-example-data/gsm8k/train.jsonl # <-- Change this path to the path of your training data.
valid_path: s3://air-example-data/gsm8k/test.jsonl # <-- Change this path to the path of your validation data. This setting is optional.

Configuring data access

LLMForge supports loading data from remote storage, like S3 and GCS, and local storage.

For datasets configured for public access, simply add the relevant training and validation file URI in your training YAML. LLMForge supports loading data stored on S3 and GCS.

tip

If you anticipate using the same dataset files for multiple runs across workspace sessions, upload the files to $ANYSCALE_ARTIFACT_STORAGE and treat it as an accessible remote path.

Prompt formatting

How prompt formatting works in llmforge

LLMForge requires formatting training and validation data in the OpenAI messages format. Each example has a "messages" entry consisting of a conversation with "system," "user," and "assistant" roles. For example:

{
"messages": [
{"role": "system", "content": "You are a helpful assistant"},
{"role": "user", "content": "What's the value of 1+1?"},
{"role": "assistant", "content": "The value is 2"}
]
}

For each role, depending on the model, LLMForge adds certain tokens as headers or footers along with a beginning of sequence token at the start of the conversation and an end of sequence token at the end of each assistant response. This templating and formatting is a crucial preprocessing step in converting the conversation format into a plain text input, which LLMForge later tokenizes and feeds into the model. For Llama-3-8B, LLMForge formats the above example as follows:

<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are a helpful assistant<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nWhat's the value of 1+1?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nThe value is 2<|eot_id|>

You can specify the prompt format in the YAML as a part of the generation_config for the model. LLMForge uses the same format in its inference code:

generation_config:
prompt_format:
system:
user:
assistant:
trailing_assistant: # inference-only
bos: # optional
system_in_user: # optional
default_system_message: # optional

Native models in the list of supported models have default generation config parameters. This configuration means that you don't need to specify generation_config when you just want to fine-tune a model like meta-llama/Meta-Llama-3-8B-Instruct directly.

Examples

For meta-llama/Meta-Llama-3-8B, use the following prompt format:

generation_config:
prompt_format:
system: "<|start_header_id|>system<|end_header_id|>\n\n{instruction}<|eot_id|>"
user: "<|start_header_id|>user<|end_header_id|>\n\n{instruction}<|eot_id|>"
assistant: """<|start_header_id|>assistant<|end_header_id|>\n\n{instruction}<|eot_id|>"
trailing_assistant: "<|start_header_id|>assistant<|end_header_id|>\n\n" # inference-only
bos: "<|begin_of_text|>"
system_in_user: False
default_system_message: ""

For mistralai/Mistral-7B, use the following prompt format:

generation_config:
prompt_format:
system: "{instruction} + "
user: "[INST] {system}{instruction} [/INST]"
assistant: " {instruction}</s>"
trailing_assistant: "" # inference-only
bos: "<s>"
system_in_user: True
default_system_message: ""

You can find more information on each entry in the config API reference. Each system, user, and assistant entry must contain the format specifier {instruction}, which actually formats the corresponding entry in messages.

Customizing data preprocessing and the prompt format

Each different LLM uses a different prompt format or chat template during training. To determine the appropriate way to format the data and choose the config parameters, work backwards from the text input to the model. For Llama Guard 2, for example, training uses the following format:

<|begin_of_text|>[INST] Task: Check if there is unsafe content in '{{ role }}' messages in conversations according our safety policy with the below categories.

<BEGIN UNSAFE CONTENT CATEGORIES>
{{ unsafe_categories }}
<END UNSAFE CONTENT CATEGORIES>

<BEGIN CONVERSATION>

User: {{ user_message_1 }}

Agent: {{ model_answer_1 }}

<END CONVERSATION>

Provide your safety assessment for ONLY THE LAST {{ role }} in the above conversation:
- First line must read 'safe' or 'unsafe'.
- If unsafe, a second line must include a comma-separated list of violated categories. [/INST]

This format is significantly different from Llama-3's prompt format. Fortunately, the OpenAI conversation format is highly flexible and adaptable to almost any input. In a case like Llama Guard 2, it's beneficial to take control of the templating yourself and format your inputs beforehand in the data preprocessing stage. For example, consider that data looks like the following:

{
"messages": [
{"role": "system",
"content": "<|begin_of_text|>[INST] Task: Check if there is unsafe content in '{{ role }}' messages in conversations according our safety policy with the below categories.

<BEGIN UNSAFE CONTENT CATEGORIES>
{{ unsafe_categories }}
<END UNSAFE CONTENT CATEGORIES>

<BEGIN CONVERSATION>

User: {{ user_message_1 }}

Agent: {{ model_answer_1 }}

<END CONVERSATION>

",
},
{"role": "user", "content": "Provide your safety assessment for ONLY THE LAST {{ role }} in the above conversation:
- First line must read 'safe' or 'unsafe'.
- If unsafe, a second line must include a comma-separated list of violated categories. [/INST]"},
{"role": "assistant", "content": "{expected_response}"}
]
}

Note: all the entries in the messages list need to have non-empty content, and at a minimum LLMForge expects one user and one assistant message.

Because of the full templating , LLMForge needs the prompt formatter to verbatim concatenate the content in different roles. Thus, the generation config may look like the following:

generation_config:
prompt_format:
system: "{instruction}"
user: "{instruction}"
assistant: "{instruction}<|end_of_text|>"
trailing_assistant: ""
bos: "" # optional, empty string by default

For the above example, the instruction, or format specifier, that you pass to the system template is almost the entire prompt, which is mainly the problem context. The instruction you pass to the user template contains the specific instructions for the LLM, and the instruction you pass to the assistant template is the expected response, which is safe or unsafe. Also note that this format is only one of the many possibilities of prompt_format you can specify, with the data preprocessing changing accordingly.

Inference time behavior

After customizing the prompt format during fine-tuning, make sure that you are using the same format at inference. You can use the inference template to deploy your fine-tuned model and specify the same prompt format parameters under the generation entry in the YAML.