Deprecated

LLMForge is being deprecated: The Ray Team is consolidating around open source fine-tuning solutions. Llama Factory and Axolotl provide enhanced functionality (quantization, advanced algorithms) and native Ray support for scaling. See the migration guide for transitioning your workflows.

Preparing your dataset

This guide focuses on how you can bring your own data to fine-tune your model using LLMForge.

Example YAML

This guide specifies training and validation paths in the train_path and valid_path entries in the config file as shown in the example YAML below. Validation path is optional.

model_id: meta-llama/Meta-Llama-3-8B-Instruct
train_path: s3://air-example-data/gsm8k/train.jsonl # <-- Change this path to the path of your training data.
valid_path: s3://air-example-data/gsm8k/test.jsonl # <-- Change this path to the path of your validation data. This setting is optional.

Configuring data access

LLMForge supports loading data from remote storage, like S3 and GCS, and local storage.

Public remote storage
Private remote storage
Local storage

For datasets configured for public access, simply add the relevant training and validation file URI in your training YAML. LLMForge supports loading data stored on S3 and GCS.

With private storage, you have two options:

Option 1: Configure permissions directly in your cloud account

The most convenient option is to provide read permissions for your Anyscale workspace for the specific bucket. See Access to private cloud storage for more details.

Option 2: Sync data into default cloud storage provided by Anyscale

Another option is to sync your data into Anyscale-provided storage and then continue with fine-tuning. Anyscale configures this storage bucket to have read access to it. For example, for private data on AWS S3, first configure your workspace to access the data. Export relevant environment variables directly, for example: AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_SESSION_TOKEN, etc., into your current terminal session. Next, move this data into the default object storage bucket, $ANYSCALE_ARTIFACT_STORAGE, provided by Anyscale . This option persists across runs or workspace restarts, as opposed to downloading the files into your workspace, which you need to repeat for every run or workspace restart.

First, download the data into your workspace:

aws s3 sync s3://<bucket_name>/<path_to_data_dir>/ myfiles/

The default object storage bucket that Anyscale configures for the workspace uses Anyscale-managed credentials internally. You need to reset the credentials so that it doesn't interfere with the Anyscale-managed access setup. For example, if your Anyscale hosted cloud is on AWS, then adding your AWS credentials to your private bucket means that aws can't access the default object storage bucket, $ANYSCALE_ARTIFACT_STORAGE, anymore. Reset your credentials by simply setting the relevant environment variables to the empty string.
Next, upload your data to $ANYSCALE_ARTIFACT_STORAGE with the relevant AWS S3 or GCS CLI depending on your Anyscale cloud. For example:

GCP:
```
gcloud storage cp -r myfiles/ $ANYSCALE_ARTIFACT_STORAGE/myfiles/
```
AWS:
```
aws s3 sync myfiles/ $ANYSCALE_ARTIFACT_STORAGE/myfiles/
```
Finally, update the training and validation paths in the training config YAML.

For local files you have two options:

Upload to remote storage and follow the instructions above. This option is the most reliable for large datasets.
Upload directly to your Anyscale workspace: This option is the simplest for small files. Use the UI in the VS Code window: right click a folder and click Upload to upload your training files. You need to place this data in the shared cluster storage /mnt/cluster_storage so that it's accessible by all the worker nodes. See Storage for more information on workspace storage. For example, if you uploaded a folder my_files with the following structure:

myfiles/
├── train.jsonl
└── val.jsonl

You can now do:

mv myfiles /mnt/cluster_storage

Next, update the training config YAML to point to the right training and validation files.

train_path: /mnt/cluster_storage/myfiles/train.jsonl
valid_path: /mnt/cluster_storage/myfiles/test.jsonl

tip

If you anticipate using the same dataset files for multiple runs across workspace sessions, upload the files to $ANYSCALE_ARTIFACT_STORAGE and treat it as an accessible remote path.

Prompt formatting

How prompt formatting works in `llmforge`

LLMForge requires formatting training and validation data in the OpenAI messages format. Each example has a "messages" entry consisting of a conversation with "system," "user," and "assistant" roles. For example:

{
  "messages": [
    {"role": "system", "content": "You are a helpful assistant"},
    {"role": "user", "content": "What's the value of 1+1?"},
    {"role": "assistant", "content": "The value is 2"}
    ]
  }

For each role, depending on the model, LLMForge adds certain tokens as headers or footers along with a beginning of sequence token at the start of the conversation and an end of sequence token at the end of each assistant response. This templating and formatting is a crucial preprocessing step in converting the conversation format into a plain text input, which LLMForge later tokenizes and feeds into the model. For Llama-3-8B, LLMForge formats the above example as follows:

<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are a helpful assistant<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nWhat's the value of 1+1?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nThe value is 2<|eot_id|>

You can specify the prompt format in the YAML as a part of the generation_config for the model. LLMForge uses the same format in its inference code:

generation_config:
  prompt_format:
    system:
    user:
    assistant:
    trailing_assistant:  # inference-only
    bos: # optional
    system_in_user: # optional
    default_system_message: # optional

Native models in the list of supported models have default generation config parameters. This configuration means that you don't need to specify generation_config when you just want to fine-tune a model like meta-llama/Meta-Llama-3-8B-Instruct directly.

Examples

For meta-llama/Meta-Llama-3-8B, use the following prompt format:

generation_config:
  prompt_format:
    system: "<|start_header_id|>system<|end_header_id|>\n\n{instruction}<|eot_id|>"
    user: "<|start_header_id|>user<|end_header_id|>\n\n{instruction}<|eot_id|>"
    assistant: """<|start_header_id|>assistant<|end_header_id|>\n\n{instruction}<|eot_id|>"
    trailing_assistant: "<|start_header_id|>assistant<|end_header_id|>\n\n" # inference-only
    bos: "<|begin_of_text|>"
    system_in_user: False
    default_system_message: ""

For mistralai/Mistral-7B, use the following prompt format:

generation_config:
  prompt_format:
    system: "{instruction} + "
    user: "[INST] {system}{instruction} [/INST]"
    assistant: " {instruction}</s>"
    trailing_assistant: "" # inference-only
    bos: "<s>"
    system_in_user: True
    default_system_message: ""

You can find more information on each entry in the config API reference. Each system, user, and assistant entry must contain the format specifier {instruction}, which actually formats the corresponding entry in messages.

Customizing data preprocessing and the prompt format

Each different LLM uses a different prompt format or chat template during training. To determine the appropriate way to format the data and choose the config parameters, work backwards from the text input to the model. For Llama Guard 2, for example, training uses the following format:

<|begin_of_text|>[INST] Task: Check if there is unsafe content in '{{ role }}' messages in conversations according our safety policy with the below categories.

<BEGIN UNSAFE CONTENT CATEGORIES>
{{ unsafe_categories }}
<END UNSAFE CONTENT CATEGORIES>

<BEGIN CONVERSATION>

User: {{ user_message_1 }}

Agent: {{ model_answer_1 }}

<END CONVERSATION>

Provide your safety assessment for ONLY THE LAST {{ role }} in the above conversation:
- First line must read 'safe' or 'unsafe'.
- If unsafe, a second line must include a comma-separated list of violated categories. [/INST]

This format is significantly different from Llama-3's prompt format. Fortunately, the OpenAI conversation format is highly flexible and adaptable to almost any input. In a case like Llama Guard 2, it's beneficial to take control of the templating yourself and format your inputs beforehand in the data preprocessing stage. For example, consider that data looks like the following:

{
  "messages": [
    {"role": "system",
    "content": "<|begin_of_text|>[INST] Task: Check if there is unsafe content in '{{ role }}' messages in conversations according our safety policy with the below categories.

<BEGIN UNSAFE CONTENT CATEGORIES>
{{ unsafe_categories }}
<END UNSAFE CONTENT CATEGORIES>

<BEGIN CONVERSATION>

User: {{ user_message_1 }}

Agent: {{ model_answer_1 }}

<END CONVERSATION>

",
  },
    {"role": "user", "content":  "Provide your safety assessment for ONLY THE LAST {{ role }} in the above conversation:
- First line must read 'safe' or 'unsafe'.
- If unsafe, a second line must include a comma-separated list of violated categories. [/INST]"},
    {"role": "assistant", "content": "{expected_response}"}
    ]
  }

Note: all the entries in the messages list need to have non-empty content, and at a minimum LLMForge expects one user and one assistant message.

Because of the full templating , LLMForge needs the prompt formatter to verbatim concatenate the content in different roles. Thus, the generation config may look like the following:

generation_config:
  prompt_format:
    system: "{instruction}"
    user: "{instruction}"
    assistant: "{instruction}<|end_of_text|>"
    trailing_assistant: ""
    bos: "" # optional, empty string by default

For the above example, the instruction, or format specifier, that you pass to the system template is almost the entire prompt, which is mainly the problem context. The instruction you pass to the user template contains the specific instructions for the LLM, and the instruction you pass to the assistant template is the expected response, which is safe or unsafe. Also note that this format is only one of the many possibilities of prompt_format you can specify, with the data preprocessing changing accordingly.

Inference time behavior

After customizing the prompt format during fine-tuning, make sure that you are using the same format at inference. You can use the inference template to deploy your fine-tuned model and specify the same prompt format parameters under the generation entry in the YAML.

Example YAML​

Configuring data access​

Option 1: Configure permissions directly in your cloud account​

Option 2: Sync data into default cloud storage provided by Anyscale​

Prompt formatting​

How prompt formatting works in llmforge​

Examples​

Customizing data preprocessing and the prompt format​

Inference time behavior​