Skip to main content

Deploy Llama 3.1 70b

Deploy Llama 3.1 70b

This example uses Ray Serve along with vLLM to deploy a Llama 3.1 70b model as an Anyscale service. The same code can be used for similarly sized models.

Install the Anyscale CLI

pip install -U anyscale
anyscale login

Deploy the service

Clone the example from GitHub.

git clone https://github.com/anyscale/examples.git
cd examples/deploy_llama_3_1_70b

Deploy the service. Use --env to forward your Hugging Face token if you need authentication for gated models like Llama 3.

anyscale service deploy -f service.yaml --env HF_TOKEN=${HF_TOKEN:?HF_TOKEN is not set}

The logic in ${HF_TOKEN:?HF_TOKEN is not set} just raises an error if no Hugging Face token is present. If you don't have a Hugging Face token, you can use one of the ungated models (change model_name in serve.py). Not only do the Llama models require a Hugging Face token, you also need to request permission to use the models (here for 3.1 and here for 3.3).

Understanding the example

  • The application code sets the required accelerator type with accelerator_type="L40S". This accelerator type is available on AWS. On other clouds, use an accelerator type like "A100" or "H100". See the list of supported accelerators for available options. Depending on the accelerator type that you use, will will also need to select the appropriate instance types in service.yaml.
  • Ray Serve automatically autoscales the number of model replicas between min_replicas and max_replicas. Ray Serve adapts the number of replicas by monitoring queue sizes. For more information on configuring autoscaling, see the AutoscalingConfig documentation.
  • This example uses vLLM, and the Dockerfile defines the service’s dependencies. When you run anyscale service deploy, the build process adds these dependencies on top of an Anyscale-provided base image.
  • To configure vLLM, modify the engine_kwargs dictionary. See Ray documentation for the LLMConfig object.

Query the service

The anyscale service deploy command outputs a line that looks like

curl -H "Authorization: Bearer <SERVICE_TOKEN>" <BASE_URL>

From the output, you can extract the service token and base URL. Open query.py and add them to the appropriate fields.

token = <SERVICE_TOKEN> 
base_url = <BASE_URL>

Query the model

pip install openai
python query.py

View the service in the services tab of the Anyscale console.

Shutdown

Shutdown your Anyscale Service:

anyscale service terminate -n deploy-70b