Version: Canary 🐤

Bring your own data

This guide focuses on how you can bring your own data to fine-tune your model using LLMForge.

Example YAML

We specify training and validation file paths in the train_path and valid_path entries in the config file as shown in the example YAML below. Validation file path is optional.

model_id: meta-llama/Meta-Llama-3-8B-Instruct
train_path: s3://air-example-data/gsm8k/train.jsonl # <-- change this to the path to your training data
valid_path: s3://air-example-data/gsm8k/test.jsonl # <-- change this to the path to your validation data. This is optional

Setup Data Access

LLMForge supports loading data from remote storage (S3, GCS) and local storage.

Public Remote Storage
Private Remote Storage
Local Storage

For datasets configured for public access, you simply need to add the relevant training and validation file URI in your training YAML. We support loading from data stored on S3 and GCS.

With private storage, you have two options:

Option 1: Configure permissions directly in your cloud account

The most convenient option is to provide read permissions for your Anyscale workspace for the specific bucket. You can follow our guide to do so here.

Option 2: Sync data into default cloud storage provided by Anyscale

Another option is to sync (copy) your data into Anyscale-provided storage and then continue with fine-tuning. Anyscale configures this storage bucket to have read access to it. Consider private data on AWS S3. First, configure your workspace to access the data. Export relevant environment variables directly (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_SESSION_TOKEN, etc) into your current terminal session. Next, move this data into the default object storage bucket provided by Anyscale ($ANYSCALE_ARTIFACT_STORAGE). That way, across runs/ workspace restarts, you don't have to repeat this process (compared to just downloading the files into your workspace).

First, download the data into your workspace:

aws s3 sync s3://<bucket_name>/<path_to_data_dir>/ myfiles/

The default object storage bucket configured for you in your workspace uses Anyscale-managed credentials internally. It is recommended to reset the credentials you provided so as to not interfere with the Anyscale-managed access setup. For example, if your Anyscale hosted cloud is on AWS, then adding your AWS credentials to your private bucket means that aws can't access the default object storage bucket ($ANYSCALE_ARTIFACT_STORAGE) anymore. Thus, reset your credentials by simply setting the relevant environment variables to the empty string.
Next, you can upload your data to $ANYSCALE_ARTIFACT_STORAGE with the relevant CLI (AWS S3 / GCS depending on your Anyscale Cloud). For example:

GCP:
```
gcloud storage cp -r myfiles/ $ANYSCALE_ARTIFACT_STORAGE/myfiles/
```
AWS:
```
aws s3 sync myfiles/ $ANYSCALE_ARTIFACT_STORAGE/myfiles/
```
Finally, you can update the training and validation paths in your training config YAML.

For local files you have two options:

Upload to remote storage and follow the instructions above (the more reliable option for large datasets).
Upload directly to your Anyscale workspace: This is the simplest option for small files. You can use the UI in your VS Code window (simply right click -> upload files/folder) and upload your training files. This data needs to be placed in the shared cluster storage /mnt/cluster_storage so that it's accessible by all the worker nodes. (For more on workspace storage, see our guide here). For example, let's say I uploaded a folder my_files with the following structure:

myfiles/  
├── train.jsonl
└── val.jsonl

I would now do:

mv myfiles /mnt/cluster_storage

Next, update your training config YAML to point to the right training and validation files.

train_path: /mnt/cluster_storage/myfiles/train.jsonl
valid_path: /mnt/cluster_storage/myfiles/test.jsonl

tip

If you anticipate to use the same dataset files for multiple runs/ across workspace sessions, you should upload the files to $ANYSCALE_ARTIFACT_STORAGE and treat it as an accessible remote path.

Example YAML​

Setup Data Access

Option 1: Configure permissions directly in your cloud account​

Option 2: Sync data into default cloud storage provided by Anyscale​

Example YAML

Option 1: Configure permissions directly in your cloud account

Option 2: Sync data into default cloud storage provided by Anyscale