Deprecated

LLMForge is being deprecated: The Ray Team is consolidating around open source fine-tuning solutions. Llama Factory and Axolotl provide enhanced functionality (quantization, advanced algorithms) and native Ray support for scaling. See the migration guide for transitioning your workflows.

Vision-language tuning

Vision-language instruction tuning trains models to reason over images interleaved with text. You must specify all images within the "user" role messages. The model only calculates loss over the "assistant" role tokens. You can specify images in the data with the format specified below. The "url" field accepts valid HTTP URLs, base64 encoded images, paths to local files, "s3://" paths, or "gs://" paths.

LLMForge supports fine-tuning of LLaVA style models, and supports full fine-tuning and LoRA fine-tuning over the multi-modal projection and language model portions of the model. LLMForge freezes the vision encoder for all models during fine-tuning. The following is an image from this Hugging Face blog:

assets

Supported models

LLMForge supports fine-tuning for the following vision-language models:

mistral-community/pixtral-12b
HuggingFaceTB/SmolVLM-Base, HuggingFaceTB/SmolVLM-Instruct, HuggingFaceTB/SmolVLM-Synthetic

note

LLMForge supports vision-language tuning only with llmforge versions >= 0.5.9. See the full list of available versions and images.

Example config

model_id: mistral-community/pixtral-12b
task: vision_language # Optional: LLMForge can infer this task when you provide `vision_language_config`.
vision_language_config:
  image_resolution: [224, 368] # H, W to reshape images to.
  vision_encoder_scaling_config: # Config for the frozen vision encoder.
    custom_resources:
      accelerator_type:A10G: 1
    concurrency: 1 # Runs vision encoder forward passes on 1 GPU.
    batch_size: 8

# LLMForge spawns additional Ray Data processes for vision-language data processing. Setting these parameters
# explicitly makes sure that there are enough compute resources for all portions of the data pipeline.
data_processor_config: 
  concurrency: 8
  batch_size: 8
...

Example dataset

{
    "messages": 
    [
        {
            "role": "user", 
            "content": 
            [
                {"type": "image_url", "image_url": {"url": "s3://anyscale-vision-language-example-data/images/processed_0.png"}}, 
                {"type": "image_url", "image_url": {"url": "s3://anyscale-vision-language-example-data/images/processed_1.png"}}, 
                {"type": "text", "content": "What is the difference between the two pizzas in these images?"}
            ]
        }, 
        {
            "role": "assistant", 
            "content": 
            [
                {"type": "text", "content": "The pizza in the first image is on a red plate and being held by an old lady, while the pizza in the second image is on a metal counter being prepared by a woman in a blue shirt."}
            ]
        },
    ]
}

Supported models​

Example config​

Example dataset​

Supported models

Example config

Example dataset