Skip to main content

Vision Language Models with RayLLM

You can serve vision language models like LLaVA in the same manner that you serve any other language model. Only a few config parameters would be different. The resulting service is OpenAI compatible. Here is what an example query for vision-language model would look like:

Example query for a vision language model

This is how you can query a vision language model (example is running on llava-hf/llava-v1.6-mistral-7b-hf).

note

Only single image, single turn conversation is supported now. We will improve our support for various prompt formats as models mature in this field.

curl "$ANYSCALE_BASE_URL/chat/completions" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $ANYSCALE_API_KEY" \
-d '{
"model": "llava-hf/llava-v1.6-mistral-7b-hf",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "What is shown in this picture?"
},
{
"type": "image_url",
"image_url": {
"url": "https://github.com/haotian-liu/LLaVA/blob/1a91fc274d7c35a9b50b3cb29c4247ae5837ce39/images/llava_v1_5_radar.jpg?raw=true"
}
]
}
],
"temperature": 0.7
}'

You can send images using either image_url or image_base64. The response will be a text response.

Configs to enable vision language model in RayLLM

note

Only HF compatible LLaVA models are supported. We will improve our support for various prompt formats as models mature in this field.

To enable vision language models in RayLLM, you need to make the following changes in the model config YAML file:

  • input_modality: Set this to image. This ensures that RayLLM converts image URLs into in-memory images.
  • vision under prompt_format: Set this to true. This ensures that prompts with image URLs are correctly validated and parsed.
...
input_modality: image

generation_config:
prompt_format:
system: "{instruction}" # not used for now
assistant: "{instruction}" # not used for now
trailing_assistant: ""
user: "[INST] <image>\n{instruction} [/INST]" # Pixtral would expect [IMG] instad of <image>
vision: true
...