Vision Language Models with RayLLM

You can serve vision language models like LLaVA in the same manner that you serve any other language model. Only a few config parameters would be different. The resulting service is OpenAI compatible. Here is what an example query for vision-language model would look like:

Example query for a vision language model

This is how you can query a vision language model (example is running on llava-hf/llava-v1.6-mistral-7b-hf).

note

Only single image, single turn conversation is supported now. We will improve our support for various prompt format as models mature in this field.

cURL
Python
OpenAI Python SDK

curl "$ANYSCALE_BASE_URL/chat/completions" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $ANYSCALE_API_KEY" \
  -d '{
    "model": "llava-hf/llava-v1.6-mistral-7b-hf",
    "messages": [
      {
        "role": "user",
        "content": [
          {
            "type": "text",
            "text": "What is shown in this picture?"
          },
          {
            "type": "image_url",
            "image_url": {
              "url": "https://github.com/haotian-liu/LLaVA/blob/1a91fc274d7c35a9b50b3cb29c4247ae5837ce39/images/llava_v1_5_radar.jpg?raw=true"
          }
        ]
      }
    ],
    "temperature": 0.7
  }'

import os
import requests

s = requests.Session()

api_base = os.getenv("ANYSCALE_BASE_URL")
token = os.getenv("ANYSCALE_API_KEY")
url = f"{api_base}/chat/completions"
body = {
    "model": "llava-hf/llava-v1.6-mistral-7b-hf",
    "messages": [
        {
          "role": "user",
          "content": [
            {
              "type": "text",
              "text": "What is shown in this picture?",
            },
            {
              "type": "image_url",
              "image_url": {
                "url": "https://github.com/haotian-liu/LLaVA/blob/1a91fc274d7c35a9b50b3cb29c4247ae5837ce39/images/llava_v1_5_radar.jpg?raw=true"
              }
            }
          ]
        }
    ],
    "temperature": 0.7
}

with s.post(url, headers={"Authorization": f"Bearer {token}"}, json=body) as resp:
    print(resp.json())

import openai

client = openai.OpenAI(
    base_url = "<PUT_YOUR_BASE_URL>",
    api_key = "FAKE_API_KEY")

# Note: not all arguments are currently supported and will be ignored by the backend.
chat_completion = client.chat.completions.create(
    model="llava-hf/llava-v1.6-mistral-7b-hf",
    messages=[
        {
          "role": "user",
          "content": [
            {
              "type": "text",
              "text": "What is shown in this picture?",
            },
            {
              "type": "image_url",
              "image_url": {
                "url": "https://github.com/haotian-liu/LLaVA/blob/1a91fc274d7c35a9b50b3cb29c4247ae5837ce39/images/llava_v1_5_radar.jpg?raw=true"
              }
            }
          ]
        }
    ],
    temperature=0.7
)
print(chat_completion.model_dump())

You can send images using either image_url or image_base64. The response will be a text response.

Configs to enable vision language model in RayLLM

To enable vision language models in RayLLM, you need to make the following changes in the model config YAML file:

input_modality: Set this to image. This ensures that RayLLM converts image URLs into in-memory images.
vision under prompt_format: Set this to true. This ensures that prompts with image URLs are correctly validated and parsed.

...
input_modality: image

generation_config:
  prompt_format:
    system: "{instruction}" # not used for now
      assistant: "{instruction}" # not used for now
      trailing_assistant: ""
      user: "[INST] <image>\n{instruction} [/INST]"
      vision: true
...

Configs to enable vision language model in RayLLM​

Configs to enable vision language model in RayLLM