Skip to main content
Version: Latest

Fine-tuning Configs API

Check your docs version

These docs are for the new Anyscale design. If you started using Anyscale before April 2024, use Version 1.0.0 of the docs. If you're transitioning to Anyscale Preview, see the guide for how to migrate.

This document describes the API for the FinetuningConfig model, which is used to configure the config YAMLs used in the fine-tuning example on Anyscale platform.


The main config model for defining fine-tuning jobs.

checkpoint_every_n_epochsOptional[int]1If provided, the model will run validation and save a checkpoint after every n epochs. If None, we automatically set it such that max_num_checkpoints checkpointing events are triggered during training.
classifier_configOptional[ClassifierConfig]NoneConfig for the LLM-classifier
context_lengthOptional[int]NoneThe context length to use for training. If not provided, it will be automatically calculated based on the dataset statistics as follows: At the beginning of training we analyze the training and validation dataset. The context length is chosen to be the maximum of the 95th percentile of the token count and the context length of the base model. For context length extension this can be larger than the base model context length.
deepspeedOptional[DeepspeedConfig]NoneThe deepspeed configuration to use for training. It is a required field if you want to use deepspeed for training.
embedding_scaling_techniquestrnoneThe technique to use for context length extension. It can be 'none' or 'linear' for positional interpolation.
eval_batch_size_per_deviceint1The batch size per device to use for evaluation.
flash_attention_2boolTrueSet to True to use flash attention v2 kernels.
gradient_accumulation_stepsint1The number of gradient accumulation steps. Use this number to effectively increase the batch size per device to improve convergence.
initial_adapter_model_ckpt_pathOptional[str]NoneIf provided, and LoRA is enabled, load the initialization of the adapter weights from this path. Curently supports S3.
initial_base_model_ckpt_pathOptional[str]NoneIf provided, load the base model weights from this path. Curently supports S3.
learning_ratefloat5e-06The learning rate to use for training.
lora_configdictWe support any huggingface compatible LoRA configuraion, such as different ranks, target modules, etc. If not provided, we will do full-parameter finetuning.
lr_scheduler_typestrcosineThe learning rate scheduler type to use for training. It can be 'cosine' or 'linear'.
max_num_checkpointsint10The maximum number validation + checkpointing events to trigger. Also if checkpoint_every_n_epochs is not provided, this will set the frequency at which we run validation and checkpoint events.
min_num_update_stepsint100The minimum number of update steps to ensure model convergence. This is used only when num_epochs is not provided to calculate the number of epochs accordingly.
model_idstrN/AThe base model id according to huggingface model hub.
no_gradient_checkpointboolFalseWhether to use gradient checkpointing to save memory. By default we enable gradient checkpointing. This can save enable training larger models on larger context lengths by trading off speed.
num_checkpoints_to_keepOptional[int]1The number of checkpoints to keep. You can choose to keep more than one checkpoint to have multiple checkpoints to validate after training.
num_data_blocks_per_deviceint2Number of dataset blocks per GPU. Controls data ingestion intensity. Increasing improves loading speed but excessive blocks can cause autoscaling and overhead. Tune for faster performance without going overboard. Default recommended if unsure.
num_devicesintN/AThe number of GPUs to do zero-data parallel training.
num_epochsOptional[int]NoneThe number of epochs to train the model. If not provided, it will be automatically calculated based on the dataset size and the minimum number of updates. We want to make sure we cover at least min_num_update_steps updates for convergence.
num_warmup_stepsint10The number of warmup steps for cosine learning rate scheduler.
pad_to_multiple_ofOptional[int]8Input sequences would be padded to a multiple of the specified value during training, to leverage Tensor Cores on NVIDIA Volta (or newer) hardware.
paddingPaddingStrategyPaddingStrategy.LONGESTThe padding strategy to use for training. When doing benchmarking it is recommended to use 'max_length' padding strategy to make sure the model is trained on the longest sequence length, so that you can catch OOMs early. If you are doing production training, you can use 'longest' padding strategy to not waste compute on padding tokens.
preprocess_batch_sizeOptional[int]NoneBatch size for the dataset preprocessing step. Default of None which uses the entire block as the batch.
train_batch_size_per_deviceint1The batch size per device to use for training, without considering gradient accumulation.
train_pathstrN/AThe location of the training dataset. It can be a local path or a remote path on s3 or gcs.
trainer_resourcesdictThe rank-zero worker resources to use during training. It is a dictionary that maps the rank-zero worker to the correct resource type. For example, {'memory': 10_000_000_000} means we want rank-zero to be scheduled on a machine with at least 10G of RAM, e.g. for weight aggregation during checkpointing. Rank-zero worker can have different resource requirements compared to rank-non-zero workers. Usually we put different memory requirement for rank-0 in full-parameter fine-tuning to provide more CPU-RAM for weight aggregation. See example configs.
valid_pathOptional[str]NoneThe location of the validation dataset. It can be a local path or a remote path on s3 or gcs. If provided, the model will be evaluated on this dataset after each epoch and best checkpoint will be saved according to the lower achieved perplexity on this dataset. If not provided the checkpoints will be saved according to the lower achieved perplexity on the training dataset.
worker_resourcesdictThe rank-non-zero worker resources to use during training. It is a dictionary that maps the rank-non-zero workers to the correct resource type. For example, {'accelerator_type:A10G': 0.001} means A10G instance should be used for each worker. Rank-zero worker can have different resource requirements such as different RAM requirements compared to rank-non-zero workers. Some common GPU types include A100-40G, A100-80G, H100, L4. The availability of these GPUs depend on demand or reservation within your cloud. See example configs for concrete examples.


config_pathstrN/APath to the DeepSpeed configuration file


Enum NameValue


eval_metricslist[Metric][]List of evaluation metrics to be used for the classifier.
label_tokenslist[str]N/AList of tokens representing the labels for classification.


Enum NameValue