Vision Language Models with RayLLM
You can serve vision language models like LLaVA in the same manner that you serve any other language model. Only a few config parameters would be different. The resulting service is OpenAI compatible. Here is what an example query for vision-language model would look like:
Example query for a vision language model
This is how you can query a vision language model (example is running on llava-hf/llava-v1.6-mistral-7b-hf
).
Only single image, single turn conversation is supported now. We will improve our support for various prompt formats as models mature in this field.
- cURL
- Python
- OpenAI Python SDK
curl "$ANYSCALE_BASE_URL/chat/completions" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $ANYSCALE_API_KEY" \
-d '{
"model": "llava-hf/llava-v1.6-mistral-7b-hf",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "What is shown in this picture?"
},
{
"type": "image_url",
"image_url": {
"url": "https://github.com/haotian-liu/LLaVA/blob/1a91fc274d7c35a9b50b3cb29c4247ae5837ce39/images/llava_v1_5_radar.jpg?raw=true"
}
]
}
],
"temperature": 0.7
}'
import os
import requests
s = requests.Session()
api_base = os.getenv("ANYSCALE_BASE_URL")
token = os.getenv("ANYSCALE_API_KEY")
url = f"{api_base}/chat/completions"
body = {
"model": "llava-hf/llava-v1.6-mistral-7b-hf",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "What is shown in this picture?",
},
{
"type": "image_url",
"image_url": {
"url": "https://github.com/haotian-liu/LLaVA/blob/1a91fc274d7c35a9b50b3cb29c4247ae5837ce39/images/llava_v1_5_radar.jpg?raw=true"
}
}
]
}
],
"temperature": 0.7
}
with s.post(url, headers={"Authorization": f"Bearer {token}"}, json=body) as resp:
print(resp.json())
import openai
client = openai.OpenAI(
base_url = "<PUT_YOUR_BASE_URL>",
api_key = "FAKE_API_KEY")
# Note: not all arguments are currently supported and will be ignored by the backend.
chat_completion = client.chat.completions.create(
model="llava-hf/llava-v1.6-mistral-7b-hf",
messages=[
{
"role": "user",
"content": [
{
"type": "text",
"text": "What is shown in this picture?",
},
{
"type": "image_url",
"image_url": {
"url": "https://github.com/haotian-liu/LLaVA/blob/1a91fc274d7c35a9b50b3cb29c4247ae5837ce39/images/llava_v1_5_radar.jpg?raw=true"
}
}
]
}
],
temperature=0.7
)
print(chat_completion.model_dump())
You can send images using either image_url
or image_base64
. The response will be a text response.
Configs to enable vision language model in RayLLM
Only HF compatible LLaVA models are supported. We will improve our support for various prompt formats as models mature in this field.
To enable vision language models in RayLLM, you need to make the following changes in the model config YAML file:
input_modality
: Set this toimage
. This ensures that RayLLM converts image URLs into in-memory images.vision
underprompt_format
: Set this totrue
. This ensures that prompts with image URLs are correctly validated and parsed.
...
input_modality: image
generation_config:
prompt_format:
system: "{instruction}" # not used for now
assistant: "{instruction}" # not used for now
trailing_assistant: ""
user: "[INST] <image>\n{instruction} [/INST]" # Pixtral would expect [IMG] instad of <image>
vision: true
...