Deprecated

LLMForge is being deprecated: The Ray Team is consolidating around open source fine-tuning solutions. Llama Factory and Axolotl provide enhanced functionality (quantization, advanced algorithms) and native Ray support for scaling. See the migration guide for transitioning your workflows.

Fine-tuning Configs API

This document describes the API for the FinetuningConfig model, which is used to configure the config YAMLs used in the fine-tuning example on Anyscale platform.

`FinetuningConfig`

The main config model for defining fine-tuning jobs.

Name	Type	Default	Description
`checkpoint_and_evaluation_frequency`	Optional[CheckpointAndEvalFrequency]	None	Checkpoint and Evaluation frequency. For example, if the strategy is `epochs` and the `value` is 1, then the current model weights are saved and, if a validation dataset is provided, evaluated every epoch.
`checkpoint_every_n_epochs`	Optional[int]	None	[DEPRECATED] If provided, the model will run validation and save a checkpoint after every n epochs. By default, we save after every epoch.
`classifier_config`	Optional[ClassificationConfig]	None	Config for the LLM-classifier
`context_length`	Optional[int]	None	The context length to use for training. If not provided, it will be automatically calculated based on the dataset statistics as follows: At the beginning of training we analyze the training and validation dataset. The context length is chosen to be the maximum of the 95th percentile of the token count and the context length of the base model. For context length extension this can be larger than the base model context length.
`data_processor_config`	Optional[DatasetMapperConfig]	None	Config for the dataset preprocessing. Internally, this is a Ray dataset `map_batches` operation where we launch Ray Actors concurrently to apply the prompt format and tokenize the dataset. By default, the number of Actors/ `concurrency` is set to the number of blocks in the dataset. Note that the `custom_resources` field is ignored since data processing happens on CPU
`deepspeed`	Optional[DeepspeedConfig]	None	The deepspeed configuration to use for training. It is a required field if you want to use deepspeed for training.
`eval_batch_size_per_device`	int	1	The batch size per device to use for evaluation.
`eval_first`	bool	False	If set to True, the model will run the evaluation loop once on the validation dataset before the first training epoch.
`flash_attention_2`	bool	True	Set to True to use flash attention v2 kernels.
`generation_config`	Optional[Union[dict[str, Any], GenerationConfig]]	None	Custom generation config for the model. The generation config consists of parameters for chat templating as well as special tokens. This is required when the provided `model_id` is not in our list of supported model ids.
`gradient_accumulation_steps`	int	1	The number of gradient accumulation steps. Use this number to effectively increase the batch size per device to improve convergence.
`gradient_checkpointing`	bool	True	Whether to use gradient checkpointing to save memory. By default, gradient checkpointing is enabled. This setting enables training larger models on larger context lengths by trading off speed.
`initial_adapter_model_ckpt_path`	Optional[Union[str, RemoteStoragePath, HFHubPath, LocalPath]]	None	If provided, and LoRA is enabled, load the initialization of the adapter weights from this path. Supports S3 paths, local paths, or Hugging Face model IDs.
`initial_base_model_ckpt_path`	Optional[Union[str, RemoteStoragePath, HFHubPath, LocalPath]]	None	If provided, load the base model weights from this path. Curently supports S3 and local paths.
`learning_rate`	float	5e-06	The learning rate to use for training.
`liger_kernel`	Optional[LigerConfig]	None	Configuration for using Liger Kernels for training. See the liger_kernel repo for a list of supported models.
`logger`	Optional[LoggerConfig]	None	(Preview) Experimental config for user specified logging service. Currently MLFlow and WandB are supported. Neither is used by default.
`lora_config`	dict	`{}`	LLMForge supports any Hugging Face compatible LoRA configuration, such as different ranks, target modules, etc. If not provided, defaults to full-parameter fine-tuning.
`lr_scheduler_type`	str	cosine	The learning rate scheduler type to use for training. It can be 'cosine' or 'linear'.
`max_num_checkpoints`	Optional[int]	None	[DEPRECATED] The maximum number validation + checkpointing events to trigger if checkpointing frequency is not provided. Use `checkpoint_and_evaluation_frequency` instead.
`min_num_update_steps`	int	100	The minimum number of update steps to ensure model convergence. This is used only when `num_epochs` is not provided to calculate the number of epochs accordingly.
`model_id`	str	N/A	The base model id according to huggingface model hub.
`model_id_suffix`	Optional[str]	None	A custom tag to be included in the model name. Full model names will have the format `{model_id}:{model_id_suffix}:{uuid}`. Can be at most 32 characters, and can only contain alphanumeric characters, underscores, hyphens, and periods. If not specified, the suffix defaults to the username of the user launching the job.
`no_gradient_checkpoint`	Optional[bool]	None	[DEPRECATED] If true, disables gradient checkpointing. Use `gradient_checkpointing` instead.
`num_checkpoints_to_keep`	Optional[int]	None	The number of checkpoints to keep. You can choose to restrict the number of checkpoints that you can validate after training. By default, all the checkpoints are kept
`num_data_blocks_per_device`	int	8	Number of dataset blocks per GPU. Controls data ingestion intensity. Increasing improves loading speed but excessive blocks can cause autoscaling and overhead. Tune for faster performance without going overboard. Default recommended if unsure.
`num_devices`	int	N/A	The number of GPUs to do zero-data parallel training.
`num_epochs`	Optional[int]	None	The number of epochs to train the model. If not provided, it will be automatically calculated based on the dataset size and the minimum number of updates. We want to make sure we cover at least `min_num_update_steps` updates for convergence.
`num_warmup_steps`	int	10	The number of warmup steps for cosine learning rate scheduler.
`optimizer_config`	OptimizerConfig	See model defaults	The optimizer configuration to use for training.
`pad_to_multiple_of`	Optional[int]	8	Input sequences would be padded to a multiple of the specified value during training, to leverage Tensor Cores on NVIDIA Volta (or newer) hardware.
`padding`	PaddingStrategy	PaddingStrategy.LONGEST	The padding strategy to use for training. When doing benchmarking it is recommended to use 'max_length' padding strategy to make sure the model is trained on the longest sequence length, so that you can catch OOMs early. If you are doing production training, you can use 'longest' padding strategy to not waste compute on padding tokens.
`preference_tuning_config`	Optional[PreferenceTuningConfig]	None	Config for preference tuning
`prefetch_batches`	int	1	Number of batches to prefetch per GPU worker. See `ray.data.Dataset.iter_torch_batches` documentation for more details
`task`	Task	Task.NOT_SPECIFIED	Task for model training. We support the following tasks: ['causal_lm', 'instruction_tuning', 'preference_tuning', 'classification', 'vision_language']. If not specified, the task defaults to 'causal_lm', unless task-specific configs are provided - such as `classifier_config`. For example, if `classifier_config` is provided, but `task` is not provided, then `task` is inferred to be 'classification'
`torch_compile`	Optional[TorchCompileConfig]	None	(Preview) Configuration for using `torch.compile`. Set to true to use `torch.compile` for training. Note: LLMForge doesn't guarantee that `torch.compile` works with all models or other model optimization settings such as Liger Kernel.
`train_batch_size_per_device`	int	1	The batch size per device to use for training, without considering gradient accumulation.
`train_path`	Union[str, ReadConfig]	N/A	The location of the training dataset as well as its read configuration. It can be a local path or a remote path on s3 or gcs.
`trainer_resources`	dict	`{}`	The rank-zero worker resources to use during training. It is a dictionary that maps the rank-zero worker to the correct resource type. For example, `{'memory': 10_000_000_000}` means we want rank-zero to be scheduled on a machine with at least 10G of RAM, e.g. for weight aggregation during checkpointing. Rank-zero worker can have different resource requirements compared to rank-non-zero workers. Usually we put different memory requirement for rank-0 in full-parameter fine-tuning to provide more CPU-RAM for weight aggregation. See example configs.
`use_cli_to_checkpoint`	bool	True
`valid_path`	Optional[Union[str, ReadConfig]]	None	The location of the validation dataset as well as its read configuration. It can be a local path or a remote path on s3 or gcs. If provided, the model will be evaluated on this dataset after each epoch and best checkpoint will be saved according to the lower achieved perplexity on this dataset. If not provided the checkpoints will be saved according to the lower achieved perplexity on the training dataset.
`vision_language_config`	Optional[VisionLanguageConfig]	None	Config for vision language instruction tuning
`worker_resources`	dict	`{}`	The rank-non-zero worker resources to use during training. It is a dictionary that maps the rank-non-zero workers to the correct resource type. For example, `{'accelerator_type:A10G': 0.001}` means A10G instance should be used for each worker. Rank-zero worker can have different resource requirements such as different RAM requirements compared to rank-non-zero workers. Some common GPU types include `A100-40G`, `A100-80G`, `H100`, `L4`. The availability of these GPUs depend on demand or reservation within your cloud. See example configs for concrete examples.

`RemoteStoragePath`

Name	Type	Default	Description
`cli_args`	list[str]	`[]`	A list of flags and their (optional) values to use while accessing remote URI. A flag and its value can be provided in the same string, space-separated or provided as different entries in the list. Ex: ['--region us-east-2', '--no-sign-request'], ['--region', 'us-west-2']
`storage_type`	Optional[RemoteStorageType]	None	Remote storage type. Can be one of the following: ['aws', 'gcloud']. If not provided, this is automatically inferred from the provided uri. We use aws cli v2 by default for S3 paths.
`uri`	str	N/A	Remote URI for the file(s), including the URI scheme. Example: s3://my_bucket/folder

`RemoteStorageType`

Enum Name	Value
AWS	aws
GCP	gcloud

`HFHubPath`

Name	Type	Default	Description
`repo_id`	str	N/A	Repo ID in the HuggingFace Hub
`revision`	Optional[str]	None	Commit hash or revision for the repo in the HuggingFace Hub

`LocalPath`

Name	Type	Default	Description
`path`	str	N/A	Local path to the file(s)

`ReadConfig`

Name	Type	Default	Description
`data_format`	str	json	A string that specifies the expected data format to read. Possible choices: 'json', 'csv', 'parquet' etc. Since we use Ray Data's read mechanism, it is expected that Ray.data provides a `read_{data_format}` API accordingly. Please refer to https://docs.ray.io/en/latest/data/api/input_output.html for more information.
`params`	dict[str, Any]	`{}`	The read kwargs that are accepted by the corresponding Ray Data's `read_{data_format}` API. Please refer to https://docs.ray.io/en/latest/data/api/input_output.html for more information.
`path`	str	N/A	The data file or directory path to read from.

`OptimizerConfig`

Name	Type	Default	Description
`optimizer_cls`	str	adamw_torch	The optimizer class to use. Must be one of the following optimizers: ['adamw_torch_fused', 'adamw_torch', 'adamw_bnb_8bit', 'sgd', 'adagrad', 'lion_8bit', 'lion_32bit', 'adafactor', 'paged_adamw_8bit', 'paged_adamw_32bit', 'paged_lion_32bit', 'paged_lion_8bit', 'paged_ademamix_8bit', 'rmsprop_bnb', 'rmsprop_bnb_8bit', 'rmsprop_bnb_32bit', 'ademamix', 'ademamix_8bit', 'grokadamw', 'schedule_free_adamw', 'schedule_free_sgd']
`optimizer_kwargs`	dict[str, Any]	`{'weight_decay': 0.0, 'adam_beta1': 0.9, 'adam_beta2': 0.999, 'adam_epsilon': 1e-08}`	The optimizer specific keyword arguments to pass to the optimizer class. Learning rate must be specified separately as a top level argument. Compatible with the Huggingface TrainingArguments optimizer configuration. Arguments for AdamW can be set directly (i.e. adam_beta1, adam_beta2, adam_epsilon), and additional optimizer specific arguments can be set via the `optim_args` flag (for example: `{'weight_decay': 0.0, ..., optim_args: {'amsgrad': True}}`

`LoggerConfig`

Name	Type	Default	Description
`provider`	Literal[wandb, mlflow]	N/A	The logging provider to use
`provider_config`	Union[WandbLoggerConfig, MLflowLoggerConfig]	N/A	The logger provider configuration
`rank_zero_only`	bool	True	If True, the logger will only be used by the rank 0 process

`WandbLoggerConfig`

Name	Type	Default	Description
`group`	Optional[str]	None	The group name for the Weights and Biases run
`id`	Optional[str]	None	The Weights and Biases trial id to resume from
`name`	Optional[str]	None	The trial name for the Weights and Biases run. By default will be set to the generated model tag.
`project`	str	llmforge	The project name for the Weights and Biases run. Default is 'llmforge'.
`tags`	Optional[list[str]]	None	The tags to be associated with the Weights and Biases run

`MLflowLoggerConfig`

Name	Type	Default	Description
`create_experiment_if_not_exists`	bool	True	If True, the experiment will be created if it does not already exist
`experiment_id`	Optional[str]	None	The id of an already existing MLflow experiment to use for logging. Takes precedence over `experiment_name`.
`experiment_name`	Optional[str]	None	The name of the MLflow experiment to use for logging. If the experiment does not exist and `create_experiment_if_not_exists` is True, a new experiment will be created.
`run_name`	Optional[str]	None	The name of the MLflow run. By default will be set to the generated model tag.
`tags`	Optional[dict]	None	The tags to be associated with the MLflow run
`tracking_uri`	str	N/A	The tracking URI for the MLflow server

`LigerConfig`

Name	Type	Default	Description
`enabled`	bool	False	If true, use Liger Kernels for training. See the liger_kernel repo for a list of supported models.
`kwargs`	dict[str, Any]	`{}`	Keyword arguments to pass to the corresponding `apply_liger_kernel_to_*` function. See https://github.com/linkedin/Liger-Kernel/tree/main?tab=readme-ov-file#patching for model specific arguments.

`TorchCompileConfig`

Name	Type	Default	Description
`backend`	str	inductor	The backend to use for `torch_compile`. Default is `inductor`.
`enabled`	bool	False	If true, LLMForge compiles the model using `torch_compile`.
`kwargs`	dict[str, Any]	`{}`	Additional kwargs to pass to `torch_compile`. For example, `mode`, `dynamic`, or `options`. See full options at https://pytorch.org/docs/main/generated/torch.compile.html

`DeepspeedConfig`

Name	Type	Default	Description
`config_path`	str	N/A	Path to the DeepSpeed configuration file

`PaddingStrategy`

Enum Name	Value
LONGEST	longest
MAX_LENGTH	max_length
DO_NOT_PAD	do_not_pad

`Task`

Enum Name	Value
CAUSAL_LM	causal_lm
INSTRUCTION_TUNING	instruction_tuning
PREFERENCE_TUNING	preference_tuning
CLASSIFICATION	classification
VISION_LANGUAGE	vision_language
NOT_SPECIFIED	not_specified

`ClassificationConfig`

Name	Type	Default	Description
`eval_metrics`	list[Metric]	`[]`	List of evaluation metrics to be used for the classifier.
`label_tokens`	list[str]	N/A	List of tokens representing the labels for classification.

`Metric`

Enum Name	Value
ACCURACY	accuracy
PRAUC	prauc
F1	f1

`PreferenceTuningConfig`

Name	Type	Default	Description
`beta`	float	0.01	Beta hyperparameter for DPO
`logprob_processor_scaling_config`	DatasetMapperConfig	See model defaults	Config for reference log probability calculation needed for the preference tuning loss. Internally, this is a Ray dataset `map_batches` operation where we launch Ray Actors concurrently to compute logits (and then the log probability) from the reference model. Currently, each worker is run on 1 GPU, with the accelerator type configurable.

`DatasetMapperConfig`

Name	Type	Default	Description
`batch_size`	Optional[int]	None	Batch size per worker for the map operation. If `None`, an entire block of data is used as a batch.
`concurrency`	int	1	Number of Ray workers to use concurrently for the map operation.
`custom_resources`	dict[str, Any]	`{}`	Custom resources (per worker) to use. For running on GPUs, please specify accelerator type, with more details in https://docs.ray.io/en/latest/ray-core/scheduling/accelerators.html#accelerator-types

`VisionLanguageConfig`

Name	Type	Default	Description
`image_resolution`	Optional[Union[tuple[int, int], int]]	(224, 224)	Resolution to resize images to. Can be specified as a (H, W) tuple, an int, or None. If an int is given, it will be used for both image height and width. Images will be resized with padding so that the larger dimension matches the specified resolution, and the smaller dimension is padded with zeros. If None, then images will not be resized, and image preprocessing behavior will fall back to the default for the underlying model processor.
`vision_encoder_scaling_config`	DatasetMapperConfig	See model defaults	Config for vision encoder scaling config Internally, this is a Ray dataset `map_batches` operation where we launch Ray Actors concurrently to compute image features from the vision tower for the specified model. Currently, each worker is run on 1 GPU, with the accelerator type configurable.

`GenerationConfig`

Name	Type	Default	Description
`prompt_format`	PromptFormat	N/A	Handles chat template formatting and tokenization
`stopping_sequences`	Optional[list[str]]	None	Stopping sequences to propagate for inference. By default, we use EOS/UNK tokens at inference.

`PromptFormat`

Name	Type	Default	Description
`add_system_tags_even_if_message_is_empty`	bool	False	If True, the system message will be included in the prompt even if the content of the system message is empty.
`assistant`	str	N/A	The template for the assistant message. This is used when the input list of messages includes assistant messages. The content of those messages is reformatted with this template. It should include `{instruction}` template.
`bos`	str		The string that should be prepended to the text before sending it to the model for completion. Defaults to empty string
`default_system_message`	str		The default system message that should be included in the prompt if no system message is provided in the input list of messages. If not specified, this is an empty string
`strip_whitespace`	bool	True	If True, the whitespace in the content of the messages will be stripped.
`system`	str	N/A	The template for system message. It should include `{instruction}` template.
`system_in_last_user`	bool	False	(Inference only) If True, the system message will be included in the last user message. Otherwise, it will be included in the first user message. This is not used during fine-tuning.
`system_in_user`	bool	False	If True, the system message will be included in the user message.
`trailing_assistant`	str		(Inference only) The string that should be appended to the end of the text before sending it to the model for completion at inference time. This is not used during fine-tuning.
`user`	str	N/A	The template for the user message. It should include `{instruction}` template. If `system_in_user` is set to True, it should also include `{system}` template.

`CheckpointAndEvalFrequency`

Name	Type	Default	Description
`frequency`	int	1	Frequency for the given eval strategy. For example, if the `type` is `epochs`, then saving + evaluation is performed every `value` epochs
`unit`	FrequencyEnum	FrequencyEnum.EPOCHS	Strategy type. `step` indicates saving + evaluation at frequency of steps

`FrequencyEnum`

Enum Name	Value
STEPS	steps
EPOCHS	epochs