Fine-tuning Configs API
This document describes the API for the FinetuningConfig
model, which is used to configure the config YAMLs used in the fine-tuning example on Anyscale platform.
FinetuningConfig
The main config model for defining fine-tuning jobs.
Name | Type | Default | Description |
---|---|---|---|
checkpoint_and_evaluation_frequency | Optional[CheckpointAndEvalFrequency] | None | Checkpoint and Evaluation frequency. For example, if the strategy is epochs and the value is 1, then the current model weights are saved and, if a validation dataset is provided, evaluated every epoch. |
checkpoint_every_n_epochs | Optional[int] | None | [DEPRECATED] If provided, the model will run validation and save a checkpoint after every n epochs. By default, we save after every epoch. |
classifier_config | Optional[ClassificationConfig] | None | Config for the LLM-classifier |
context_length | Optional[int] | None | The context length to use for training. If not provided, it will be automatically calculated based on the dataset statistics as follows: At the beginning of training we analyze the training and validation dataset. The context length is chosen to be the maximum of the 95th percentile of the token count and the context length of the base model. For context length extension this can be larger than the base model context length. |
data_processor_config | Optional[DatasetMapperConfig] | None | Config for the dataset preprocessing. Internally, this is a Ray dataset map_batches operation where we launch Ray Actors concurrently to apply the prompt format and tokenize the dataset. By default, the number of Actors/ concurrency is set to the number of blocks in the dataset. Note that the custom_resources field is ignored since data processing happens on CPU |
deepspeed | Optional[DeepspeedConfig] | None | The deepspeed configuration to use for training. It is a required field if you want to use deepspeed for training. |
eval_batch_size_per_device | int | 1 | The batch size per device to use for evaluation. |
flash_attention_2 | bool | True | Set to True to use flash attention v2 kernels. |
generation_config | Optional[Union[dict[str, Any], GenerationConfig]] | None | Custom generation config for the model. The generation config consists of parameters for chat templating as well as special tokens. This is required when the provided model_id is not in our list of supported model ids. |
gradient_accumulation_steps | int | 1 | The number of gradient accumulation steps. Use this number to effectively increase the batch size per device to improve convergence. |
gradient_checkpointing | bool | True | Whether to use gradient checkpointing to save memory. By default, gradient checkpointing is enabled. This setting enables training larger models on larger context lengths by trading off speed. |
initial_adapter_model_ckpt_path | Optional[Union[str, RemoteStoragePath, HFHubPath, LocalPath]] | None | If provided, and LoRA is enabled, load the initialization of the adapter weights from this path. Supports S3 paths, local paths, or Hugging Face model IDs. |
initial_base_model_ckpt_path | Optional[Union[str, RemoteStoragePath, HFHubPath, LocalPath]] | None | If provided, load the base model weights from this path. Curently supports S3 and local paths. |
learning_rate | float | 5e-06 | The learning rate to use for training. |
liger_kernel | Optional[LigerConfig] | None | Configuration for using Liger Kernels for training. See the liger_kernel repo for a list of supported models. |
logger | Optional[LoggerConfig] | None | (Preview) Experimental config for user specified logging service. Currently MLFlow and WandB are supported. Neither is used by default. |
lora_config | dict | {} | LLMForge supports any Hugging Face compatible LoRA configuration, such as different ranks, target modules, etc. If not provided, defaults to full-parameter fine-tuning. |
lr_scheduler_type | str | cosine | The learning rate scheduler type to use for training. It can be 'cosine' or 'linear'. |
max_num_checkpoints | Optional[int] | None | [DEPRECATED] The maximum number validation + checkpointing events to trigger if checkpointing frequency is not provided. Use checkpoint_and_evaluation_frequency instead. |
min_num_update_steps | int | 100 | The minimum number of update steps to ensure model convergence. This is used only when num_epochs is not provided to calculate the number of epochs accordingly. |
model_id | str | N/A | The base model id according to huggingface model hub. |
no_gradient_checkpoint | Optional[bool] | None | [DEPRECATED] If true, disables gradient checkpointing. Use gradient_checkpointing instead. |
num_checkpoints_to_keep | Optional[int] | None | The number of checkpoints to keep. You can choose to restrict the number of checkpoints that you can validate after training. By default, all the checkpoints are kept |
num_data_blocks_per_device | int | 8 | Number of dataset blocks per GPU. Controls data ingestion intensity. Increasing improves loading speed but excessive blocks can cause autoscaling and overhead. Tune for faster performance without going overboard. Default recommended if unsure. |
num_devices | int | N/A | The number of GPUs to do zero-data parallel training. |
num_epochs | Optional[int] | None | The number of epochs to train the model. If not provided, it will be automatically calculated based on the dataset size and the minimum number of updates. We want to make sure we cover at least min_num_update_steps updates for convergence. |
num_warmup_steps | int | 10 | The number of warmup steps for cosine learning rate scheduler. |
optimizer_config | OptimizerConfig | See model defaults | The optimizer configuration to use for training. |
pad_to_multiple_of | Optional[int] | 8 | Input sequences would be padded to a multiple of the specified value during training, to leverage Tensor Cores on NVIDIA Volta (or newer) hardware. |
padding | PaddingStrategy | PaddingStrategy.LONGEST | The padding strategy to use for training. When doing benchmarking it is recommended to use 'max_length' padding strategy to make sure the model is trained on the longest sequence length, so that you can catch OOMs early. If you are doing production training, you can use 'longest' padding strategy to not waste compute on padding tokens. |
preference_tuning_config | Optional[PreferenceTuningConfig] | None | Config for preference tuning |
prefetch_batches | int | 1 | Number of batches to prefetch per GPU worker. See ray.data.Dataset.iter_torch_batches documentation for more details |
task | Task | Task.NOT_SPECIFIED | Task for model training. We support the following tasks: ['causal_lm', 'instruction_tuning', 'preference_tuning', 'classification', 'vision_language']. If not specified, the task defaults to 'causal_lm', unless task-specific configs are provided - such as classifier_config . For example, if classifier_config is provided, but task is not provided, then task is inferred to be 'classification' |
torch_compile | Optional[TorchCompileConfig] | None | (Preview) Configuration for using torch.compile . Set to true to use torch.compile for training. Note: LLMForge doesn't guarantee that torch.compile works with all models or other model optimization settings such as Liger Kernel. |
train_batch_size_per_device | int | 1 | The batch size per device to use for training, without considering gradient accumulation. |
train_path | Union[str, ReadConfig] | N/A | The location of the training dataset as well as its read configuration. It can be a local path or a remote path on s3 or gcs. |
trainer_resources | dict | {} | The rank-zero worker resources to use during training. It is a dictionary that maps the rank-zero worker to the correct resource type. For example, {'memory': 10_000_000_000} means we want rank-zero to be scheduled on a machine with at least 10G of RAM, e.g. for weight aggregation during checkpointing. Rank-zero worker can have different resource requirements compared to rank-non-zero workers. Usually we put different memory requirement for rank-0 in full-parameter fine-tuning to provide more CPU-RAM for weight aggregation. See example configs. |
use_cli_to_checkpoint | bool | True | |
valid_path | Optional[Union[str, ReadConfig]] | None | The location of the validation dataset as well as its read configuration. It can be a local path or a remote path on s3 or gcs. If provided, the model will be evaluated on this dataset after each epoch and best checkpoint will be saved according to the lower achieved perplexity on this dataset. If not provided the checkpoints will be saved according to the lower achieved perplexity on the training dataset. |
vision_language_config | Optional[VisionLanguageConfig] | None | Config for vision language instruction tuning |
worker_resources | dict | {} | The rank-non-zero worker resources to use during training. It is a dictionary that maps the rank-non-zero workers to the correct resource type. For example, {'accelerator_type:A10G': 0.001} means A10G instance should be used for each worker. Rank-zero worker can have different resource requirements such as different RAM requirements compared to rank-non-zero workers. Some common GPU types include A100-40G , A100-80G , H100 , L4 . The availability of these GPUs depend on demand or reservation within your cloud. See example configs for concrete examples. |
RemoteStoragePath
Name | Type | Default | Description |
---|---|---|---|
cli_args | list[str] | [] | A list of flags and their (optional) values to use while accessing remote URI. A flag and its value can be provided in the same string, space-separated or provided as different entries in the list. Ex: ['--region us-east-2', '--no-sign-request'], ['--region', 'us-west-2'] |
storage_type | Optional[RemoteStorageType] | None | Remote storage type. Can be one of the following: ['aws', 'gcloud']. If not provided, this is automatically inferred from the provided uri. We use aws cli v2 by default for S3 paths. |
uri | str | N/A | Remote URI for the file(s), including the URI scheme. Example: s3://my_bucket/folder |
RemoteStorageType
Enum Name | Value |
---|---|
AWS | aws |
GCP | gcloud |
HFHubPath
Name | Type | Default | Description |
---|---|---|---|
repo_id | str | N/A | Repo ID in the HuggingFace Hub |
revision | Optional[str] | None | Commit hash or revision for the repo in the HuggingFace Hub |
LocalPath
Name | Type | Default | Description |
---|---|---|---|
path | str | N/A | Local path to the file(s) |
ReadConfig
Name | Type | Default | Description |
---|---|---|---|
data_format | str | json | A string that specifies the expected data format to read. Possible choices: 'json', 'csv', 'parquet' etc. Since we use Ray Data's read mechanism, it is expected that Ray.data provides a read_{data_format} API accordingly. Please refer to https://docs.ray.io/en/latest/data/api/input_output.html for more information. |
params | dict[str, Any] | {} | The read kwargs that are accepted by the corresponding Ray Data's read_{data_format} API. Please refer to https://docs.ray.io/en/latest/data/api/input_output.html for more information. |
path | str | N/A | The data file or directory path to read from. |
OptimizerConfig
Name | Type | Default | Description |
---|---|---|---|
optimizer_cls | str | adamw_torch | The optimizer class to use. Must be one of the following optimizers: ['adamw_torch_fused', 'adamw_torch', 'adamw_bnb_8bit', 'sgd', 'adagrad', 'lion_8bit', 'lion_32bit', 'adafactor', 'paged_adamw_8bit', 'paged_adamw_32bit', 'paged_lion_32bit', 'paged_lion_8bit', 'paged_ademamix_8bit', 'rmsprop_bnb', 'rmsprop_bnb_8bit', 'rmsprop_bnb_32bit', 'ademamix', 'ademamix_8bit', 'grokadamw', 'schedule_free_adamw', 'schedule_free_sgd'] |
optimizer_kwargs | dict[str, Any] | {'weight_decay': 0.0, 'adam_beta1': 0.9, 'adam_beta2': 0.999, 'adam_epsilon': 1e-08} | The optimizer specific keyword arguments to pass to the optimizer class. Learning rate must be specified separately as a top level argument. Compatible with the Huggingface TrainingArguments optimizer configuration. Arguments for AdamW can be set directly (i.e. adam_beta1, adam_beta2, adam_epsilon), and additional optimizer specific arguments can be set via the optim_args flag (for example: {'weight_decay': 0.0, ..., optim_args: {'amsgrad': True}} ) |
LoggerConfig
Name | Type | Default | Description |
---|---|---|---|
provider | Literal[wandb, mlflow] | N/A | The logging provider to use |
provider_config | Union[WandbLoggerConfig, MLflowLoggerConfig] | N/A | The logger provider configuration |
rank_zero_only | bool | True | If True, the logger will only be used by the rank 0 process |
WandbLoggerConfig
Name | Type | Default | Description |
---|---|---|---|
group | Optional[str] | None | The group name for the Weights and Biases run |
id | Optional[str] | None | The Weights and Biases trial id to resume from |
name | Optional[str] | None | The trial name for the Weights and Biases run. By default will be set to the generated model tag. |
project | str | llmforge | The project name for the Weights and Biases run. Default is 'llmforge'. |
tags | Optional[list[str]] | None | The tags to be associated with the Weights and Biases run |
MLflowLoggerConfig
Name | Type | Default | Description |
---|---|---|---|
create_experiment_if_not_exists | bool | True | If True, the experiment will be created if it does not already exist |
experiment_id | Optional[str] | None | The id of an already existing MLflow experiment to use for logging. Takes precedence over experiment_name . |
experiment_name | Optional[str] | None | The name of the MLflow experiment to use for logging. If the experiment does not exist and create_experiment_if_not_exists is True, a new experiment will be created. |
run_name | Optional[str] | None | The name of the MLflow run. By default will be set to the generated model tag. |
tags | Optional[dict] | None | The tags to be associated with the MLflow run |
tracking_uri | str | N/A | The tracking URI for the MLflow server |
LigerConfig
Name | Type | Default | Description |
---|---|---|---|
enabled | bool | False | If true, use Liger Kernels for training. See the liger_kernel repo for a list of supported models. |
kwargs | dict[str, Any] | {} | Keyword arguments to pass to the corresponding apply_liger_kernel_to_* function. See https://github.com/linkedin/Liger-Kernel/tree/main?tab=readme-ov-file#patching for model specific arguments. |
TorchCompileConfig
Name | Type | Default | Description |
---|---|---|---|
backend | str | inductor | The backend to use for torch_compile . Default is inductor . |
enabled | bool | False | If true, LLMForge compiles the model using torch_compile . |
kwargs | dict[str, Any] | {} | Additional kwargs to pass to torch_compile . For example, mode , dynamic , or options . See full options at https://pytorch.org/docs/main/generated/torch.compile.html |
DeepspeedConfig
Name | Type | Default | Description |
---|---|---|---|
config_path | str | N/A | Path to the DeepSpeed configuration file |
PaddingStrategy
Enum Name | Value |
---|---|
LONGEST | longest |
MAX_LENGTH | max_length |
DO_NOT_PAD | do_not_pad |
Task
Enum Name | Value |
---|---|
CAUSAL_LM | causal_lm |
INSTRUCTION_TUNING | instruction_tuning |
PREFERENCE_TUNING | preference_tuning |
CLASSIFICATION | classification |
VISION_LANGUAGE | vision_language |
NOT_SPECIFIED | not_specified |
ClassificationConfig
Name | Type | Default | Description |
---|---|---|---|
eval_metrics | list[Metric] | [] | List of evaluation metrics to be used for the classifier. |
label_tokens | list[str] | N/A | List of tokens representing the labels for classification. |
Metric
Enum Name | Value |
---|---|
ACCURACY | accuracy |
PRAUC | prauc |
F1 | f1 |
PreferenceTuningConfig
Name | Type | Default | Description |
---|---|---|---|
beta | float | 0.01 | Beta hyperparameter for DPO |
logprob_processor_scaling_config | DatasetMapperConfig | See model defaults | Config for reference log probability calculation needed for the preference tuning loss. Internally, this is a Ray dataset map_batches operation where we launch Ray Actors concurrently to compute logits (and then the log probability) from the reference model. Currently, each worker is run on 1 GPU, with the accelerator type configurable. |
DatasetMapperConfig
Name | Type | Default | Description |
---|---|---|---|
batch_size | Optional[int] | None | Batch size per worker for the map operation. If None , an entire block of data is used as a batch. |
concurrency | int | 1 | Number of Ray workers to use concurrently for the map operation. |
custom_resources | dict[str, Any] | {} | Custom resources (per worker) to use. For running on GPUs, please specify accelerator type, with more details in https://docs.ray.io/en/latest/ray-core/scheduling/accelerators.html#accelerator-types |
VisionLanguageConfig
Name | Type | Default | Description |
---|---|---|---|
image_resolution | Optional[Union[tuple[int, int], int]] | (224, 224) | Resolution to resize images to. Can be specified as a (H, W) tuple, an int, or None. If an int is given, it will be used for both image height and width. Images will be resized with padding so that the larger dimension matches the specified resolution, and the smaller dimension is padded with zeros. If None, then images will not be resized, and image preprocessing behavior will fall back to the default for the underlying model processor. |
vision_encoder_scaling_config | DatasetMapperConfig | See model defaults | Config for vision encoder scaling config Internally, this is a Ray dataset map_batches operation where we launch Ray Actors concurrently to compute image features from the vision tower for the specified model. Currently, each worker is run on 1 GPU, with the accelerator type configurable. |
GenerationConfig
Name | Type | Default | Description |
---|---|---|---|
prompt_format | PromptFormat | N/A | Handles chat template formatting and tokenization |
stopping_sequences | Optional[list[str]] | None | Stopping sequences to propagate for inference. By default, we use EOS/UNK tokens at inference. |
PromptFormat
Name | Type | Default | Description |
---|---|---|---|
add_system_tags_even_if_message_is_empty | bool | False | If True, the system message will be included in the prompt even if the content of the system message is empty. |
assistant | str | N/A | The template for the assistant message. This is used when the input list of messages includes assistant messages. The content of those messages is reformatted with this template. It should include {instruction} template. |
bos | str | The string that should be prepended to the text before sending it to the model for completion. Defaults to empty string | |
default_system_message | str | The default system message that should be included in the prompt if no system message is provided in the input list of messages. If not specified, this is an empty string | |
strip_whitespace | bool | True | If True, the whitespace in the content of the messages will be stripped. |
system | str | N/A | The template for system message. It should include {instruction} template. |
system_in_last_user | bool | False | (Inference only) If True, the system message will be included in the last user message. Otherwise, it will be included in the first user message. This is not used during fine-tuning. |
system_in_user | bool | False | If True, the system message will be included in the user message. |
trailing_assistant | str | (Inference only) The string that should be appended to the end of the text before sending it to the model for completion at inference time. This is not used during fine-tuning. | |
user | str | N/A | The template for the user message. It should include {instruction} template. If system_in_user is set to True, it should also include {system} template. |
CheckpointAndEvalFrequency
Name | Type | Default | Description |
---|---|---|---|
frequency | int | 1 | Frequency for the given eval strategy. For example, if the type is epochs , then saving + evaluation is performed every value epochs |
unit | FrequencyEnum | FrequencyEnum.EPOCHS | Strategy type. step indicates saving + evaluation at frequency of steps |
FrequencyEnum
Enum Name | Value |
---|---|
STEPS | steps |
EPOCHS | epochs |