Skip to main content

Vision Language Models with RayLLM

You can serve vision language models like LLaVA in the same manner that you serve any other language model. Only a few config parameters would be different. The resulting service is OpenAI compatible. Here is what an example query for vision-language model would look like:

Example query for a vision language model

This is how you can query a vision language model (example is running on llava-hf/llava-v1.6-mistral-7b-hf).

note

Only single image, single turn conversation is supported now. We will improve our support for various prompt format as models mature in this field.

curl "$ANYSCALE_BASE_URL/chat/completions" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $ANYSCALE_API_KEY" \
-d '{
"model": "llava-hf/llava-v1.6-mistral-7b-hf",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "What is shown in this picture?"
},
{
"type": "image_url",
"image_url": {
"url": "https://github.com/haotian-liu/LLaVA/blob/1a91fc274d7c35a9b50b3cb29c4247ae5837ce39/images/llava_v1_5_radar.jpg?raw=true"
}
]
}
],
"temperature": 0.7
}'

You can send images using either image_url or image_base64. The response will be a text response.

Configs to enable vision language model in RayLLM

To enable vision language models in RayLLM, you need to make the following changes in the model config YAML file:

  • input_modality: Set this to image. This ensures that RayLLM converts image URLs into in-memory images.
  • vision under prompt_format: Set this to true. This ensures that prompts with image URLs are correctly validated and parsed.
...
input_modality: image

generation_config:
prompt_format:
system: "{instruction}" # not used for now
assistant: "{instruction}" # not used for now
trailing_assistant: ""
user: "[INST] <image>\n{instruction} [/INST]"
vision: true
...