Deploy Llama 3.1 8b

This example uses Ray Serve along with vLLM to deploy a Llama 3.1 8b model as an Anyscale service.

Install the Anyscale CLI

pip install -U anyscale
anyscale login

Deploy the service

Clone the example from GitHub.

git clone https://github.com/anyscale/examples.git
cd examples/03_deploy_llama_3_8b

Deploy the service. Use --env to forward your Hugging Face token if you need authentication for gated models such as Llama 3.

export HF_TOKEN=<INSERT HUGGING FACE TOKEN HERE>
anyscale service deploy -f service.yaml --env HF_TOKEN=$HF_TOKEN

If you’re using an ungated model, go to your LLMConfig (in serve_llama_3_1_8b.py), and set model_source to that model. Then, you can omit the Hugging Face token from both the config and the anyscale service deploy command.

Understanding the example

The application code sets the required accelerator type with accelerator_type="L4". To use a different accelerator, replace "L4" with the desired name. See the list of supported accelerators for available options.
Ray Serve automatically autoscales the number of model replicas between min_replicas and max_replicas. Ray Serve adapts the number of replicas by monitoring queue sizes. For more information on configuring autoscaling, see the AutoscalingConfig documentation.
This example uses vLLM, and the Dockerfile defines the service’s dependencies. When you run anyscale service deploy, the build process adds these dependencies on top of an Anyscale-provided base image.
To configure vLLM, modify the engine_kwargs dictionary. See Ray documentation for the LLMConfig object.

Query the service

The anyscale service deploy command outputs a line that looks like

curl -H "Authorization: Bearer <SERVICE_TOKEN>" <BASE_URL>

From the output, you can extract the service token and base URL. Open query.py and add them to the appropriate fields.

token = <SERVICE_TOKEN> 
base_url = <BASE_URL> 

Query the model

pip install openai
python query.py

View the service in the services tab of the Anyscale console.

Shutdown

Shutdown your Anyscale Service:

anyscale service terminate -n deploy-llama-3-1-8b

Install the Anyscale CLI​

Deploy the service​

Understanding the example​

Query the service​

Shutdown​

Install the Anyscale CLI

Deploy the service

Understanding the example

Query the service

Shutdown