Vision Language Models with RayLLM
You can serve vision language models like LLaVA in the same manner that you serve any other language model. Only a few config parameters would be different. The resulting service is OpenAI compatible. Here is what an example query for vision-language model would look like:
Example query for a vision language model
This is how you can query a vision language model (example is running on llava-hf/llava-v1.6-mistral-7b-hf
).
note
Only single image, single turn conversation is supported now. We will improve our support for various prompt format as models mature in this field.
- cURL
- Python
- OpenAI Python SDK
curl "$ANYSCALE_BASE_URL/chat/completions" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $ANYSCALE_API_KEY" \
-d '{
"model": "llava-hf/llava-v1.6-mistral-7b-hf",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "What is shown in this picture?"
},
{
"type": "image_url",
"image_url": {
"url": "https://github.com/haotian-liu/LLaVA/blob/1a91fc274d7c35a9b50b3cb29c4247ae5837ce39/images/llava_v1_5_radar.jpg?raw=true"
}
]
}
],
"temperature": 0.7
}'
import os
import requests
s = requests.Session()
api_base = os.getenv("ANYSCALE_BASE_URL")
token = os.getenv("ANYSCALE_API_KEY")
url = f"{api_base}/chat/completions"
body = {
"model": "llava-hf/llava-v1.6-mistral-7b-hf",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "What is shown in this picture?",
},
{
"type": "image_url",
"image_url": {
"url": "https://github.com/haotian-liu/LLaVA/blob/1a91fc274d7c35a9b50b3cb29c4247ae5837ce39/images/llava_v1_5_radar.jpg?raw=true"
}
}
]
}
],
"temperature": 0.7
}
with s.post(url, headers={"Authorization": f"Bearer {token}"}, json=body) as resp:
print(resp.json())
import openai
client = openai.OpenAI(
base_url = "<PUT_YOUR_BASE_URL>",
api_key = "FAKE_API_KEY")
# Note: not all arguments are currently supported and will be ignored by the backend.
chat_completion = client.chat.completions.create(
model="llava-hf/llava-v1.6-mistral-7b-hf",
messages=[
{
"role": "user",
"content": [
{
"type": "text",
"text": "What is shown in this picture?",
},
{
"type": "image_url",
"image_url": {
"url": "https://github.com/haotian-liu/LLaVA/blob/1a91fc274d7c35a9b50b3cb29c4247ae5837ce39/images/llava_v1_5_radar.jpg?raw=true"
}
}
]
}
],
temperature=0.7
)
print(chat_completion.model_dump())
You can send images using either image_url
or image_base64
. The response will be a text response.
Configs to enable vision language model in RayLLM
To enable vision language models in RayLLM, you need to make the following changes in the model config YAML file:
input_modality
: Set this toimage
. This ensures that RayLLM converts image URLs into in-memory images.vision
underprompt_format
: Set this totrue
. This ensures that prompts with image URLs are correctly validated and parsed.
...
input_modality: image
generation_config:
prompt_format:
system: "{instruction}" # not used for now
assistant: "{instruction}" # not used for now
trailing_assistant: ""
user: "[INST] <image>\n{instruction} [/INST]"
vision: true
...