Skip to main content

Deploy Llama 3

Install the Anyscale CLI

pip install -U anyscale
anyscale login

Deploy the service

Clone the example from GitHub.

git clone https://github.com/anyscale/examples.git
cd examples/03_deploy_llama_3_8b

This example uses Ray Serve along with vLLM to run Llama 3. The code for an endpoint that says "hello" is in serve_llama_3_8b.py.

Also take a look at service.yaml. This file specifies the container image, compute resources, script entrypoint, and a few other fields.

Deploy the service.

anyscale service deploy -f service.yaml --env HF_TOKEN=$HF_TOKEN

Query the service

The anyscale service deploy command outputs a line that looks like

curl -H "Authorization: Bearer <SERVICE_TOKEN>" <BASE_URL>

From this, you can parse out the service token and base URL. Fill them in in the appropriate location in query.py to query the model or use them in the snippet below.

from urllib.parse import urljoin
from openai import OpenAI

# The "anyscale service deploy" script outputs a line that looks like
#
# curl -H "Authorization: Bearer <SERVICE_TOKEN>" <BASE_URL>
#
# From this, you can parse out the service token and base URL.
token = <SERVICE_TOKEN> # Fill this in. If deploying and querying locally, use token = "FAKE_KEY"
base_url = <BASE_URL> # Fill this in. If deploying and querying locally, use base_url = "http://localhost:8000"

client = OpenAI(base_url= urljoin(base_url, "v1"), api_key=token)

response = client.chat.completions.create(
model="my-llama-3-8B",
messages=[
{"role": "user", "content": "What's the capital of France?"}
],
stream=True
)

# Stream and print JSON
for chunk in response:
data = chunk.choices[0].delta.content
if data:
print(data, end="", flush=True)