checkpoint_every_n_epochs | Optional[int] | 1 | If provided, the model will run validation and save a checkpoint after every n epochs. If None, we automatically set it such that max_num_checkpoints checkpointing events are triggered during training. |
classifier_config | Optional[ClassifierConfig] | None | Config for the LLM-classifier |
context_length | Optional[int] | None | The context length to use for training. If not provided, it will be automatically calculated based on the dataset statistics as follows: At the beginning of training we analyze the training and validation dataset. The context length is chosen to be the maximum of the 95th percentile of the token count and the context length of the base model. For context length extension this can be larger than the base model context length. |
deepspeed | Optional[DeepspeedConfig] | None | The deepspeed configuration to use for training. It is a required field if you want to use deepspeed for training. |
embedding_scaling_technique | str | none | The technique to use for context length extension. It can be 'none' or 'linear' for positional interpolation. |
eval_batch_size_per_device | int | 1 | The batch size per device to use for evaluation. |
flash_attention_2 | bool | True | Set to True to use flash attention v2 kernels. |
gradient_accumulation_steps | int | 1 | The number of gradient accumulation steps. Use this number to effectively increase the batch size per device to improve convergence. |
initial_adapter_model_ckpt_path | Optional[str] | None | If provided, and LoRA is enabled, load the initialization of the adapter weights from this path. Curently supports S3. |
initial_base_model_ckpt_path | Optional[str] | None | If provided, load the base model weights from this path. Curently supports S3. |
learning_rate | float | 5e-06 | The learning rate to use for training. |
lora_config | dict | | We support any huggingface compatible LoRA configuraion, such as different ranks, target modules, etc. If not provided, we will do full-parameter finetuning. |
lr_scheduler_type | str | cosine | The learning rate scheduler type to use for training. It can be 'cosine' or 'linear'. |
max_num_checkpoints | int | 10 | The maximum number validation + checkpointing events to trigger. Also if checkpoint_every_n_epochs is not provided, this will set the frequency at which we run validation and checkpoint events. |
min_num_update_steps | int | 100 | The minimum number of update steps to ensure model convergence. This is used only when num_epochs is not provided to calculate the number of epochs accordingly. |
model_id | str | N/A | The base model id according to huggingface model hub. |
no_gradient_checkpoint | bool | False | Whether to use gradient checkpointing to save memory. By default we enable gradient checkpointing. This can save enable training larger models on larger context lengths by trading off speed. |
num_checkpoints_to_keep | Optional[int] | 1 | The number of checkpoints to keep. You can choose to keep more than one checkpoint to have multiple checkpoints to validate after training. |
num_data_blocks_per_device | int | 2 | Number of dataset blocks per GPU. Controls data ingestion intensity. Increasing improves loading speed but excessive blocks can cause autoscaling and overhead. Tune for faster performance without going overboard. Default recommended if unsure. |
num_devices | int | N/A | The number of GPUs to do zero-data parallel training. |
num_epochs | Optional[int] | None | The number of epochs to train the model. If not provided, it will be automatically calculated based on the dataset size and the minimum number of updates. We want to make sure we cover at least min_num_update_steps updates for convergence. |
num_warmup_steps | int | 10 | The number of warmup steps for cosine learning rate scheduler. |
pad_to_multiple_of | Optional[int] | 8 | Input sequences would be padded to a multiple of the specified value during training, to leverage Tensor Cores on NVIDIA Volta (or newer) hardware. |
padding | PaddingStrategy | PaddingStrategy.LONGEST | The padding strategy to use for training. When doing benchmarking it is recommended to use 'max_length' padding strategy to make sure the model is trained on the longest sequence length, so that you can catch OOMs early. If you are doing production training, you can use 'longest' padding strategy to not waste compute on padding tokens. |
preprocess_batch_size | Optional[int] | None | Batch size for the dataset preprocessing step. Default of None which uses the entire block as the batch. |
train_batch_size_per_device | int | 1 | The batch size per device to use for training, without considering gradient accumulation. |
train_path | str | N/A | The location of the training dataset. It can be a local path or a remote path on s3 or gcs. |
trainer_resources | dict | | The rank-zero worker resources to use during training. It is a dictionary that maps the rank-zero worker to the correct resource type. For example, {'memory': 10_000_000_000} means we want rank-zero to be scheduled on a machine with at least 10G of RAM, e.g. for weight aggregation during checkpointing. Rank-zero worker can have different resource requirements compared to rank-non-zero workers. Usually we put different memory requirement for rank-0 in full-parameter fine-tuning to provide more CPU-RAM for weight aggregation. See example configs. |
use_cli_to_checkpoint | bool | True | |
valid_path | Optional[str] | None | The location of the validation dataset. It can be a local path or a remote path on s3 or gcs. If provided, the model will be evaluated on this dataset after each epoch and best checkpoint will be saved according to the lower achieved perplexity on this dataset. If not provided the checkpoints will be saved according to the lower achieved perplexity on the training dataset. |
worker_resources | dict | | The rank-non-zero worker resources to use during training. It is a dictionary that maps the rank-non-zero workers to the correct resource type. For example, {'accelerator_type:A10G': 0.001} means A10G instance should be used for each worker. Rank-zero worker can have different resource requirements such as different RAM requirements compared to rank-non-zero workers. Some common GPU types include A100-40G , A100-80G , H100 , L4 . The availability of these GPUs depend on demand or reservation within your cloud. See example configs for concrete examples. |