Task types
LLMForge supports the following tasks out-of-the-box:
- Causal language modeling: Loss considers predictions for all the tokens.
- Instruction tuning: Considers only "assistant" tokens in the loss.
- Classification: Predicts only a user-defined set of labels based on past tokens.
- Preference tuning: Uses the contrast between chosen and rejected messages to improve the model.
- Vision-language instruction tuning: Predicts assistant tokens based on a mix of past image and text tokens.
The following hyperparameters enable tasks:
task
classifier_config
, which is specific to classificationpreference_tuning_config
, which is specific to preference tuningvision_language_config
, which is specific to vision-language instruction tuning
Note that by default, task
defaults to "causal_lm"
unless you specify a task-specific config like classifier_config
, preference_tuning_config
, or vision_language_config
.
Dataset format
You must format the dataset in the OpenAI format for all tasks - whether you're continuing pre-training on plain text, running the causal_lm
task, or classifying messages as safe
or unsafe
. Find details for how to format data for each task type under Data formats and task configs
.