Preparing your dataset
This guide focuses on how you can bring your own data to fine-tune your model using LLMForge.
Example YAML
This guide specifies training and validation paths in the train_path
and valid_path
entries in the config file as shown in the example YAML below. Validation path is optional.
model_id: meta-llama/Meta-Llama-3-8B-Instruct
train_path: s3://air-example-data/gsm8k/train.jsonl # <-- Change this path to the path of your training data.
valid_path: s3://air-example-data/gsm8k/test.jsonl # <-- Change this path to the path of your validation data. This setting is optional.
Configuring data access
LLMForge supports loading data from remote storage, like S3 and GCS, and local storage.
- Public remote storage
- Private remote storage
- Local storage
For datasets configured for public access, simply add the relevant training and validation file URI in your training YAML. LLMForge supports loading data stored on S3 and GCS.
With private storage, you have two options:
Option 1: Configure permissions directly in your cloud account
The most convenient option is to provide read permissions for your Anyscale workspace for the specific bucket. See Access to private cloud storage for more details.
Option 2: Sync data into default cloud storage provided by Anyscale
Another option is to sync your data into Anyscale-provided storage and then continue with fine-tuning. Anyscale configures this storage bucket to have read access to it. For example, for private data on AWS S3, first configure your workspace to access the data. Export relevant environment variables directly, for example: AWS_ACCESS_KEY_ID
, AWS_SECRET_ACCESS_KEY
, AWS_SESSION_TOKEN
, etc., into your current terminal session. Next, move this data into
the default object storage bucket, $ANYSCALE_ARTIFACT_STORAGE
, provided by Anyscale . This option persists across runs or workspace restarts, as opposed to downloading the files into your workspace, which you need to repeat for every run or workspace restart.
-
First, download the data into your workspace:
aws s3 sync s3://<bucket_name>/<path_to_data_dir>/ myfiles/
-
The default object storage bucket that Anyscale configures for the workspace uses Anyscale-managed credentials internally. You need to reset the credentials so that it doesn't interfere with the Anyscale-managed access setup. For example, if your Anyscale hosted cloud is on AWS, then adding your AWS credentials to your private bucket means that
aws
can't access the default object storage bucket,$ANYSCALE_ARTIFACT_STORAGE
, anymore. Reset your credentials by simply setting the relevant environment variables to the empty string. -
Next, upload your data to
$ANYSCALE_ARTIFACT_STORAGE
with the relevant AWS S3 or GCS CLI depending on your Anyscale cloud. For example:GCP:
gcloud storage cp -r myfiles/ $ANYSCALE_ARTIFACT_STORAGE/myfiles/
AWS:
aws s3 sync myfiles/ $ANYSCALE_ARTIFACT_STORAGE/myfiles/
-
Finally, update the training and validation paths in the training config YAML.
For local files you have two options:
- Upload to remote storage and follow the instructions above. This option is the most reliable for large datasets.
- Upload directly to your Anyscale workspace: This option is the simplest for small files. Use the UI in the VS Code window: right click a folder and click Upload to upload your training files. You need to place this data in the shared cluster storage
/mnt/cluster_storage
so that it's accessible by all the worker nodes. See Storage for more information on workspace storage. For example, if you uploaded a foldermy_files
with the following structure:
myfiles/
├── train.jsonl
└── val.jsonl
You can now do:
mv myfiles /mnt/cluster_storage
Next, update the training config YAML to point to the right training and validation files.
train_path: /mnt/cluster_storage/myfiles/train.jsonl
valid_path: /mnt/cluster_storage/myfiles/test.jsonl
If you anticipate using the same dataset files for multiple runs across workspace sessions, upload the files to $ANYSCALE_ARTIFACT_STORAGE
and treat it as an accessible remote path.
Prompt formatting
How prompt formatting works in llmforge
LLMForge requires formatting training and validation data in the OpenAI messages format. Each example has a "messages" entry consisting of a conversation with "system," "user," and "assistant" roles. For example:
{
"messages": [
{"role": "system", "content": "You are a helpful assistant"},
{"role": "user", "content": "What's the value of 1+1?"},
{"role": "assistant", "content": "The value is 2"}
]
}
For each role, depending on the model, LLMForge adds certain tokens as headers or footers along with a beginning of sequence token at the start of the conversation and an end of sequence token at the end of each assistant response. This templating and formatting is a crucial preprocessing step in converting the conversation format into a plain text input, which LLMForge later tokenizes and feeds into the model. For Llama-3-8B, LLMForge formats the above example as follows:
<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are a helpful assistant<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nWhat's the value of 1+1?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nThe value is 2<|eot_id|>
You can specify the prompt format in the YAML as a part of the generation_config
for the model. LLMForge uses the same format in its inference code:
generation_config:
prompt_format:
system:
user:
assistant:
trailing_assistant: # inference-only
bos: # optional
system_in_user: # optional
default_system_message: # optional
Native models in the list of supported models have default generation config parameters. This configuration means that you don't need to specify generation_config
when you just want to fine-tune a model like meta-llama/Meta-Llama-3-8B-Instruct
directly.
Examples
For meta-llama/Meta-Llama-3-8B
, use the following prompt format:
generation_config:
prompt_format:
system: "<|start_header_id|>system<|end_header_id|>\n\n{instruction}<|eot_id|>"
user: "<|start_header_id|>user<|end_header_id|>\n\n{instruction}<|eot_id|>"
assistant: """<|start_header_id|>assistant<|end_header_id|>\n\n{instruction}<|eot_id|>"
trailing_assistant: "<|start_header_id|>assistant<|end_header_id|>\n\n" # inference-only
bos: "<|begin_of_text|>"
system_in_user: False
default_system_message: ""
For mistralai/Mistral-7B
, use the following prompt format:
generation_config:
prompt_format:
system: "{instruction} + "
user: "[INST] {system}{instruction} [/INST]"
assistant: " {instruction}</s>"
trailing_assistant: "" # inference-only
bos: "<s>"
system_in_user: True
default_system_message: ""
You can find more information on each entry in the config API reference. Each system,
user,
and assistant
entry must contain the format specifier {instruction}
, which actually formats the corresponding entry in messages
.
Customizing data preprocessing and the prompt format
Each different LLM uses a different prompt format or chat template during training. To determine the appropriate way to format the data and choose the config parameters, work backwards from the text input to the model. For Llama Guard 2, for example, training uses the following format:
<|begin_of_text|>[INST] Task: Check if there is unsafe content in '{{ role }}' messages in conversations according our safety policy with the below categories.
<BEGIN UNSAFE CONTENT CATEGORIES>
{{ unsafe_categories }}
<END UNSAFE CONTENT CATEGORIES>
<BEGIN CONVERSATION>
User: {{ user_message_1 }}
Agent: {{ model_answer_1 }}
<END CONVERSATION>
Provide your safety assessment for ONLY THE LAST {{ role }} in the above conversation:
- First line must read 'safe' or 'unsafe'.
- If unsafe, a second line must include a comma-separated list of violated categories. [/INST]
This format is significantly different from Llama-3's prompt format. Fortunately, the OpenAI conversation format is highly flexible and adaptable to almost any input. In a case like Llama Guard 2, it's beneficial to take control of the templating yourself and format your inputs beforehand in the data preprocessing stage. For example, consider that data looks like the following:
{
"messages": [
{"role": "system",
"content": "<|begin_of_text|>[INST] Task: Check if there is unsafe content in '{{ role }}' messages in conversations according our safety policy with the below categories.
<BEGIN UNSAFE CONTENT CATEGORIES>
{{ unsafe_categories }}
<END UNSAFE CONTENT CATEGORIES>
<BEGIN CONVERSATION>
User: {{ user_message_1 }}
Agent: {{ model_answer_1 }}
<END CONVERSATION>
",
},
{"role": "user", "content": "Provide your safety assessment for ONLY THE LAST {{ role }} in the above conversation:
- First line must read 'safe' or 'unsafe'.
- If unsafe, a second line must include a comma-separated list of violated categories. [/INST]"},
{"role": "assistant", "content": "{expected_response}"}
]
}
Note: all the entries in the messages list need to have non-empty content, and at a minimum LLMForge expects one user and one assistant message.
Because of the full templating , LLMForge needs the prompt formatter to verbatim concatenate the content in different roles. Thus, the generation config may look like the following:
generation_config:
prompt_format:
system: "{instruction}"
user: "{instruction}"
assistant: "{instruction}<|end_of_text|>"
trailing_assistant: ""
bos: "" # optional, empty string by default
For the above example, the instruction
, or format specifier, that you pass to the system
template is almost the entire prompt, which is mainly the problem context. The instruction
you pass to the user
template contains the specific instructions for the LLM, and the instruction
you pass to the assistant
template is the expected response, which is safe
or unsafe
. Also note that this format is only one of the many possibilities of prompt_format
you can specify, with the data preprocessing changing accordingly.
Inference time behavior
After customizing the prompt format during fine-tuning, make sure that you are using the same format at inference. You can use the inference template to deploy your fine-tuned model and specify the same prompt format parameters under the generation
entry in the YAML.