Bring your own data
This guide focuses on how you can bring your own data to fine-tune your model using LLMForge.
Example YAML
We specify training and validation file paths in the train_path
and valid_path
entries in the config file as shown in the example YAML below. Validation file path is optional.
model_id: meta-llama/Meta-Llama-3-8B-Instruct
train_path: s3://air-example-data/gsm8k/train.jsonl # <-- change this to the path to your training data
valid_path: s3://air-example-data/gsm8k/test.jsonl # <-- change this to the path to your validation data. This is optional
Setup Data Access
LLMForge supports loading data from remote storage (S3, GCS) and local storage.
- Public Remote Storage
- Private Remote Storage
- Local Storage
For datasets configured for public access, you simply need to add the relevant training and validation file URI in your training YAML. We support loading from data stored on S3 and GCS.
With private storage, you have two options:
Option 1: Configure permissions directly in your cloud account
The most convenient option is to provide read permissions for your Anyscale workspace for the specific bucket. You can follow our guide to do so here.
Option 2: Sync data into default cloud storage provided by Anyscale
Another option is to sync (copy) your data into Anyscale-provided storage and then continue with fine-tuning. Anyscale configures this storage bucket to have read access to it. Consider private data on AWS S3. First, configure your workspace to access the data. Export relevant environment variables directly (AWS_ACCESS_KEY_ID
, AWS_SECRET_ACCESS_KEY
, AWS_SESSION_TOKEN
, etc) into your current terminal session. Next, move this data into
the default object storage bucket provided by Anyscale ($ANYSCALE_ARTIFACT_STORAGE
). That way, across runs/ workspace restarts, you don't have to repeat this process (compared to just downloading the files into your workspace).
-
First, download the data into your workspace:
aws s3 sync s3://<bucket_name>/<path_to_data_dir>/ myfiles/
-
The default object storage bucket configured for you in your workspace uses Anyscale-managed credentials internally. It is recommended to reset the credentials you provided so as to not interfere with the Anyscale-managed access setup. For example, if your Anyscale hosted cloud is on AWS, then adding your AWS credentials to your private bucket means that
aws
can't access the default object storage bucket ($ANYSCALE_ARTIFACT_STORAGE
) anymore. Thus, reset your credentials by simply setting the relevant environment variables to the empty string. -
Next, you can upload your data to
$ANYSCALE_ARTIFACT_STORAGE
with the relevant CLI (AWS S3 / GCS depending on your Anyscale Cloud). For example:GCP:
gcloud storage cp -r myfiles/ $ANYSCALE_ARTIFACT_STORAGE/myfiles/
AWS:
aws s3 sync myfiles/ $ANYSCALE_ARTIFACT_STORAGE/myfiles/
-
Finally, you can update the training and validation paths in your training config YAML.
For local files you have two options:
- Upload to remote storage and follow the instructions above (the more reliable option for large datasets).
- Upload directly to your Anyscale workspace: This is the simplest option for small files. You can use the UI in your VS Code window (simply right click -> upload files/folder) and upload your training files. This data needs to be placed in the shared cluster storage
/mnt/cluster_storage
so that it's accessible by all the worker nodes. (For more on workspace storage, see our guide here). For example, let's say I uploaded a foldermy_files
with the following structure:
myfiles/
├── train.jsonl
└── val.jsonl
I would now do:
mv myfiles /mnt/cluster_storage
Next, update your training config YAML to point to the right training and validation files.
train_path: /mnt/cluster_storage/myfiles/train.jsonl
valid_path: /mnt/cluster_storage/myfiles/test.jsonl
If you anticipate to use the same dataset files for multiple runs/ across workspace sessions, you should upload the files to $ANYSCALE_ARTIFACT_STORAGE
and treat it as an accessible remote path.