Skip to main content

Fine-tuning Configs API

This document describes the API for the FinetuningConfig model, which is used to configure the config YAMLs used in the fine-tuning example on Anyscale platform.

FinetuningConfig

The main config model for defining fine-tuning jobs.

NameTypeDefaultDescription
checkpoint_every_n_epochsOptional[int]1If provided, the model will run validation and save a checkpoint after every n epochs. If None, we automatically set it such that max_num_checkpoints checkpointing events are triggered during training.
classifier_configOptional[ClassificationConfig]NoneConfig for the LLM-classifier
context_lengthOptional[int]NoneThe context length to use for training. If not provided, it will be automatically calculated based on the dataset statistics as follows: At the beginning of training we analyze the training and validation dataset. The context length is chosen to be the maximum of the 95th percentile of the token count and the context length of the base model. For context length extension this can be larger than the base model context length.
data_processor_configOptional[DatasetMapperConfig]NoneConfig for the dataset preprocessing. Internally, this is a Ray dataset map_batches operation where we launch Ray Actors concurrently to apply the prompt format and tokenize the dataset. By default, the number of Actors/ concurrency is set to the number of blocks in the dataset. Note that the custom_resources field is ignored since data processing happens on CPU
deepspeedOptional[DeepspeedConfig]NoneThe deepspeed configuration to use for training. It is a required field if you want to use deepspeed for training.
eval_batch_size_per_deviceint1The batch size per device to use for evaluation.
flash_attention_2boolTrueSet to True to use flash attention v2 kernels.
generation_configOptional[Union[dict[str, Any], GenerationConfig]]NoneCustom generation config for the model. The generation config consists of parameters for chat templating as well as special tokens. This is required when the provided model_id is not in our list of supported model ids.
gradient_accumulation_stepsint1The number of gradient accumulation steps. Use this number to effectively increase the batch size per device to improve convergence.
gradient_checkpointingboolTrueWhether to use gradient checkpointing to save memory. By default, gradient checkpointing is enabled. This setting enables training larger models on larger context lengths by trading off speed.
initial_adapter_model_ckpt_pathOptional[Union[str, RemoteStoragePath, HFHubPath, LocalPath]]NoneIf provided, and LoRA is enabled, load the initialization of the adapter weights from this path. Supports S3 paths, local paths, or Hugging Face model IDs.
initial_base_model_ckpt_pathOptional[Union[str, RemoteStoragePath, HFHubPath, LocalPath]]NoneIf provided, load the base model weights from this path. Curently supports S3 and local paths.
learning_ratefloat5e-06The learning rate to use for training.
liger_kernelOptional[LigerConfig]NoneConfiguration for using Liger Kernels for training. See the liger_kernel repo for a list of supported models.
loggerOptional[LoggerConfig]None(Preview) Experimental config for user specified logging service. Currently MLFlow and WandB are supported. Neither is used by default.
lora_configdictLLMForge supports any Hugging Face compatible LoRA configuration, such as different ranks, target modules, etc. If not provided, defaults to full-parameter fine-tuning.
lr_scheduler_typestrcosineThe learning rate scheduler type to use for training. It can be 'cosine' or 'linear'.
max_num_checkpointsint10The maximum number validation + checkpointing events to trigger. Also if checkpoint_every_n_epochs is not provided, this will set the frequency at which we run validation and checkpoint events.
min_num_update_stepsint100The minimum number of update steps to ensure model convergence. This is used only when num_epochs is not provided to calculate the number of epochs accordingly.
model_idstrN/AThe base model id according to huggingface model hub.
no_gradient_checkpointOptional[bool]None[DEPRECATED] If true, disables gradient checkpointing. Use gradient_checkpointing instead.
num_checkpoints_to_keepOptional[int]1The number of checkpoints to keep. You can choose to keep more than one checkpoint to have multiple checkpoints to validate after training.
num_data_blocks_per_deviceint8Number of dataset blocks per GPU. Controls data ingestion intensity. Increasing improves loading speed but excessive blocks can cause autoscaling and overhead. Tune for faster performance without going overboard. Default recommended if unsure.
num_devicesintN/AThe number of GPUs to do zero-data parallel training.
num_epochsOptional[int]NoneThe number of epochs to train the model. If not provided, it will be automatically calculated based on the dataset size and the minimum number of updates. We want to make sure we cover at least min_num_update_steps updates for convergence.
num_warmup_stepsint10The number of warmup steps for cosine learning rate scheduler.
pad_to_multiple_ofOptional[int]8Input sequences would be padded to a multiple of the specified value during training, to leverage Tensor Cores on NVIDIA Volta (or newer) hardware.
paddingPaddingStrategyPaddingStrategy.LONGESTThe padding strategy to use for training. When doing benchmarking it is recommended to use 'max_length' padding strategy to make sure the model is trained on the longest sequence length, so that you can catch OOMs early. If you are doing production training, you can use 'longest' padding strategy to not waste compute on padding tokens.
preference_tuning_configOptional[PreferenceTuningConfig]NoneConfig for preference tuning
prefetch_batchesint1Number of batches to prefetch per GPU worker. See ray.data.Dataset.iter_torch_batches documentation for more details
taskTaskTask.NOT_SPECIFIEDTask for model training. We support the following tasks: ['causal_lm', 'instruction_tuning', 'preference_tuning', 'classification']. If not specified, the task defaults to 'causal_lm', unless task-specific configs are provided - such as classifier_config. For example, if classifier_config is provided, but task is not provided, then task is inferred to be 'classification'
torch_compileOptional[TorchCompileConfig]None(Preview) Configuration for using torch.compile. Set to true to use torch.compile for training. Note: LLMForge doesn't guarantee that torch.compile works with all models or other model optimization settings such as Liger Kernel.
train_batch_size_per_deviceint1The batch size per device to use for training, without considering gradient accumulation.
train_pathUnion[str, ReadConfig]N/AThe location of the training dataset as well as its read configuration. It can be a local path or a remote path on s3 or gcs.
trainer_resourcesdictThe rank-zero worker resources to use during training. It is a dictionary that maps the rank-zero worker to the correct resource type. For example, {'memory': 10_000_000_000} means we want rank-zero to be scheduled on a machine with at least 10G of RAM, e.g. for weight aggregation during checkpointing. Rank-zero worker can have different resource requirements compared to rank-non-zero workers. Usually we put different memory requirement for rank-0 in full-parameter fine-tuning to provide more CPU-RAM for weight aggregation. See example configs.
use_cli_to_checkpointboolTrue
valid_pathOptional[Union[str, ReadConfig]]NoneThe location of the validation dataset as well as its read configuration. It can be a local path or a remote path on s3 or gcs. If provided, the model will be evaluated on this dataset after each epoch and best checkpoint will be saved according to the lower achieved perplexity on this dataset. If not provided the checkpoints will be saved according to the lower achieved perplexity on the training dataset.
worker_resourcesdictThe rank-non-zero worker resources to use during training. It is a dictionary that maps the rank-non-zero workers to the correct resource type. For example, {'accelerator_type:A10G': 0.001} means A10G instance should be used for each worker. Rank-zero worker can have different resource requirements such as different RAM requirements compared to rank-non-zero workers. Some common GPU types include A100-40G, A100-80G, H100, L4. The availability of these GPUs depend on demand or reservation within your cloud. See example configs for concrete examples.

RemoteStoragePath

NameTypeDefaultDescription
cli_argslist[str][]A list of flags and their (optional) values to use while accessing remote URI. A flag and its value can be provided in the same string, space-separated or provided as different entries in the list. Ex: ['--region us-east-2', '--no-sign-request'], ['--region', 'us-west-2']
storage_typeOptional[RemoteStorageType]NoneRemote storage type. Can be one of the following: ['aws', 'awsv2', 'gcloud']. If not provided, this is automatically inferred from the provided uri. We use aws cli v2 by default for S3 paths.
uristrN/ARemote URI for the file(s), including the URI scheme. Example: s3://my_bucket/folder

RemoteStorageType

Enum NameValue
AWSaws
AWS_V2awsv2
GCPgcloud

HFHubPath

NameTypeDefaultDescription
repo_idstrN/ARepo ID in the HuggingFace Hub
revisionOptional[str]NoneCommit hash or revision for the repo in the HuggingFace Hub

LocalPath

NameTypeDefaultDescription
pathstrN/ALocal path to the file(s)

ReadConfig

NameTypeDefaultDescription
data_formatstrjsonA string that specifies the expected data format to read. Possible choices: 'json', 'csv', 'parquet' etc. Since we use Ray Data's read mechanism, it is expected that Ray.data provides a read_{data_format} API accordingly. Please refer to https://docs.ray.io/en/latest/data/api/input_output.html for more information.
paramsdict[str, Any]The read kwargs that are accepted by the corresponding Ray Data's read_{data_format} API. Please refer to https://docs.ray.io/en/latest/data/api/input_output.html for more information.
pathstrN/AThe data file or directory path to read from.

LoggerConfig

NameTypeDefaultDescription
providerLiteral[wandb, mlflow]N/AThe logging provider to use
provider_configUnion[WandbLoggerConfig, MLflowLoggerConfig]N/AThe logger provider configuration
rank_zero_onlyboolTrueIf True, the logger will only be used by the rank 0 process

WandbLoggerConfig

NameTypeDefaultDescription
groupOptional[str]NoneThe group name for the Weights and Biases run
idOptional[str]NoneThe Weights and Biases trial id to resume from
nameOptional[str]NoneThe trial name for the Weights and Biases run. By default will be set to the generated model tag.
projectstrllmforgeThe project name for the Weights and Biases run. Default is 'llmforge'.
tagsOptional[list[str]]NoneThe tags to be associated with the Weights and Biases run

MLflowLoggerConfig

NameTypeDefaultDescription
create_experiment_if_not_existsboolTrueIf True, the experiment will be created if it does not already exist
experiment_idOptional[str]NoneThe id of an already existing MLflow experiment to use for logging. Takes precedence over experiment_name.
experiment_nameOptional[str]NoneThe name of the MLflow experiment to use for logging. If the experiment does not exist and create_experiment_if_not_exists is True, a new experiment will be created.
run_nameOptional[str]NoneThe name of the MLflow run. By default will be set to the generated model tag.
tagsOptional[dict]NoneThe tags to be associated with the MLflow run
tracking_uristrN/AThe tracking URI for the MLflow server

LigerConfig

NameTypeDefaultDescription
enabledboolFalseIf true, use Liger Kernels for training. See the liger_kernel repo for a list of supported models.
kwargsdict[str, Any]Keyword arguments to pass to the corresponding apply_liger_kernel_to_* function. See https://github.com/linkedin/Liger-Kernel/tree/main?tab=readme-ov-file#patching for model specific arguments.

TorchCompileConfig

NameTypeDefaultDescription
backendstrinductorThe backend to use for torch_compile. Default is inductor.
enabledboolFalseIf true, LLMForge compiles the model using torch_compile.
kwargsdict[str, Any]Additional kwargs to pass to torch_compile. For example, mode, dynamic, or options. See full options at https://pytorch.org/docs/main/generated/torch.compile.html

DeepspeedConfig

NameTypeDefaultDescription
config_pathstrN/APath to the DeepSpeed configuration file

PaddingStrategy

Enum NameValue
LONGESTlongest
MAX_LENGTHmax_length
DO_NOT_PADdo_not_pad

Task

Enum NameValue
CAUSAL_LMcausal_lm
INSTRUCTION_TUNINGinstruction_tuning
PREFERENCE_TUNINGpreference_tuning
CLASSIFICATIONclassification
NOT_SPECIFIEDnot_specified

ClassificationConfig

NameTypeDefaultDescription
eval_metricslist[Metric][]List of evaluation metrics to be used for the classifier.
label_tokenslist[str]N/AList of tokens representing the labels for classification.

Metric

Enum NameValue
ACCURACYaccuracy
PRAUCprauc
F1f1

PreferenceTuningConfig

NameTypeDefaultDescription
betafloat0.01Beta hyperparameter for DPO
logprob_processor_scaling_configDatasetMapperConfigcustom_resources= concurrency=1 batch_size=NoneConfig for reference log probability calculation needed for the preference tuning loss. Internally, this is a Ray dataset map_batches operation where we launch Ray Actors concurrently to compute logits (and then the log probability) from the reference model. Currently, each worker is run on 1 GPU, with the accelerator type configurable.

DatasetMapperConfig

NameTypeDefaultDescription
batch_sizeOptional[int]NoneBatch size per worker for the map operation. If None, an entire block of data is used as a batch.
concurrencyint1Number of Ray workers to use concurrently for the map operation.
custom_resourcesdict[str, Any]Custom resources (per worker) to use. For running on GPUs, please specify accelerator type, with more details in https://docs.ray.io/en/latest/ray-core/scheduling/accelerators.html#accelerator-types

GenerationConfig

NameTypeDefaultDescription
prompt_formatPromptFormatN/AHandles chat template formatting and tokenization
stopping_sequencesOptional[list[str]]NoneStopping sequences to propagate for inference. By default, we use EOS/UNK tokens at inference.

PromptFormat

NameTypeDefaultDescription
add_system_tags_even_if_message_is_emptyboolFalseIf True, the system message will be included in the prompt even if the content of the system message is empty.
assistantstrN/AThe template for the assistant message. This is used when the input list of messages includes assistant messages. The content of those messages is reformatted with this template. It should include {instruction} template.
bosstrThe string that should be prepended to the text before sending it to the model for completion. Defaults to empty string
default_system_messagestrThe default system message that should be included in the prompt if no system message is provided in the input list of messages. If not specified, this is an empty string
strip_whitespaceboolTrueIf True, the whitespace in the content of the messages will be stripped.
systemstrN/AThe template for system message. It should include {instruction} template.
system_in_last_userboolFalse(Inference only) If True, the system message will be included in the last user message. Otherwise, it will be included in the first user message. This is not used during fine-tuning.
system_in_userboolFalseIf True, the system message will be included in the user message.
trailing_assistantstr(Inference only) The string that should be appended to the end of the text before sending it to the model for completion at inference time. This is not used during fine-tuning.
userstrN/AThe template for the user message. It should include {instruction} template. If system_in_user is set to True, it should also include {system} template.