Skip to main content

Bring your own data

This guide focuses on how you can bring your own data to fine-tune your model using LLMForge.

Example YAML

We specify training and validation file paths in the train_path and valid_path entries in the config file as shown in the example YAML below. Validation file path is optional.

model_id: meta-llama/Meta-Llama-3-8B-Instruct
train_path: s3://air-example-data/gsm8k/train.jsonl # <-- change this to the path to your training data
valid_path: s3://air-example-data/gsm8k/test.jsonl # <-- change this to the path to your validation data. This is optional

Setup Data Access

LLMForge supports loading data from remote storage (S3, GCS) and local storage.

For datasets configured for public access, you simply need to add the relevant training and validation file URI in your training YAML. We support loading from data stored on S3 and GCS.

tip

If you anticipate to use the same dataset files for multiple runs/ across workspace sessions, you should upload the files to $ANYSCALE_ARTIFACT_STORAGE and treat it as an accessible remote path.