Skip to main content

Vision-language tuning

Vision-language instruction tuning trains models to reason over images interleaved with text. You must specify all images within the "user" role messages. The model only calculates loss over the "assistant" role tokens. You can specify images in the data with the format specified below. The "url" field accepts valid HTTP URLs, base64 encoded images, paths to local files, "s3://" paths, or "gs://" paths.

LLMForge supports fine-tuning of LLaVA style models, and supports full fine-tuning and LoRA fine-tuning over the multi-modal projection and language model portions of the model. LLMForge freezes the vision encoder for all models during fine-tuning. The following is an image from this Hugging Face blog:

assets

Supported models

LLMForge supports fine-tuning for the following vision-language models:

  • mistral-community/pixtral-12b
  • HuggingFaceTB/SmolVLM-Base, HuggingFaceTB/SmolVLM-Instruct, HuggingFaceTB/SmolVLM-Synthetic
note

LLMForge supports vision-language tuning only with llmforge versions >= 0.5.9. See the full list of available versions and images.

Example config

model_id: mistral-community/pixtral-12b
task: vision_language # Optional: LLMForge can infer this task when you provide `vision_language_config`.
vision_language_config:
image_resolution: [224, 368] # H, W to reshape images to.
vision_encoder_scaling_config: # Config for the frozen vision encoder.
custom_resources:
accelerator_type:A10G: 1
concurrency: 1 # Runs vision encoder forward passes on 1 GPU.
batch_size: 8

# LLMForge spawns additional Ray Data processes for vision-language data processing. Setting these parameters
# explicitly makes sure that there are enough compute resources for all portions of the data pipeline.
data_processor_config:
concurrency: 8
batch_size: 8
...

Example dataset

{
"messages":
[
{
"role": "user",
"content":
[
{"type": "image_url", "image_url": {"url": "s3://anyscale-vision-language-example-data/images/processed_0.png"}},
{"type": "image_url", "image_url": {"url": "s3://anyscale-vision-language-example-data/images/processed_1.png"}},
{"type": "text", "content": "What is the difference between the two pizzas in these images?"}
]
},
{
"role": "assistant",
"content":
[
{"type": "text", "content": "The pizza in the first image is on a red plate and being held by an old lady, while the pizza in the second image is on a metal counter being prepared by a woman in a blue shirt."}
]
},
]
}