Get started

Anyscale Private Endpoints offers a streamlined interface for developers to leverage state-of-the-art open source large language models (LLMs) to power AI applications. Deploying in a private cloud environment allows teams to meet their specific privacy, control, and customization requirements.

LLM applications that you build with Private Endpoints are backed by the Ray and Anyscale and inherit robust production-ready features like zero downtime upgrades, high availability, and enhanced observability. When you're ready for enterprise-level solutions and support, the transition to the expansive capabilities of the Anyscale Platform for machine learning workloads is seamless.

Set up your account

Sign up for Anyscale Private Endpoints to receive an invite code.
Create an account or sign in through the Anyscale Console.

Cloud prerequisites

To use Anyscale Private Endpoints, you must satisfy the following requirements:

Deploy an Anyscale Cloud.
Ensure that this cloud has sufficient quota to deploy your LLMs.

☁️Cloud quotas

For Anyscale Private Endpoints to serve your models, you must modify the default resource quotas set by your cloud service provider.

Note: The availability of instances can vary by region and zone, so confirm with your cloud service provider that your selection can accommodate your instance needs.

How to update AWS quotas

For Amazon EC2, follow these steps.

Navigate to the AWS Management Console and sign in.
Open the Services dropdown menu and under the Management & Governance section, open Service Quotas
Request a quota increase for the following instances. Remember that quotas are region specific, so update the relevant one.

Spot instance quotas

All G and VT Spot Instance Requests: Default is 0. Set to at least 512, which supports 8 G5.12xlarge and 8 G5.4xlarge spot instances.
All Standard (A, C, D, H, I, M, R, T, Z) Spot Instance Requests: Default is 5. Set to at least 512, which supports 16 M5.8xlarge spot instances.
All P4, P3, and P2 Spot Instance Requests. Default is 64: Set to at least 224, which supports 4 P3.8xlarge instances and 1 P4de.24xlarge instance.

Standard instance quotas

Running On-Demand G and VT instances: Default is 0. Set to at least 512, which supports 8 G5.12xlarge instances and 8 G5.4xlarge instances.
Running On-Demand Standard (A, C, D, H, I, M, R, T, Z) instances: Default is 5. Set to at least 544,which supports 17 M5.8xlarge instances.
Running On-Demand P instances: Default is 64. Set to at least 224, which supports 4 P3.8xlarge instances and 1 P4de.24xlarge instance.

How to update GCP quotas

Navigate to the Google Cloud Quotas page
Filter the quotas by each quota metric listed below
Click the checkbox from the same region as your Anyscale Cloud
Select EDIT QUOTAS to enter a new limit.

Quota Metric	Minimum Recommended Quota
`compute.googleapis.com/preemptible_cpus`	256 CPUs
`compute.googleapis.com/preemptible_nvidia_t4_gpus`	32 GPUs
`compute.googleapis.com/preemptible_nvidia_v100_gpus`	16 GPUs
`compute.googleapis.com/preemptible_nvidia_a100_gpus`	8 GPUs
`compute.googleapis.com/CPUs`	256 CPUs
`compute.googleapis.com/n2_cpus`	128 CPUs
`compute.googleapis.com/nvidia_t4_gpus`	32 GPUs
`compute.googleapis.com/nvidia_v100_gpus`	16 GPUs
`compute.googleapis.com/nvidia_a100_gpus`	8 GPUs
`Persistent Disk SSD (GB)`	1000 GB

Deploy an Anyscale Private Endpoint

Step 1: Create a new Endpoint

Click on Endpoints server, and then Create.

Step 2: Configure the deployment

Customize your settings:

Endpoint name: Fill in a unique name; the name is immutable after deployment.
Endpoint version: Select the latest.
Cloud name: Choose the cloud that you set up with the adjusted quotas to run your Private Endpoint in.
Select models to deploy: Choose which models you would like to deploy. You can update this selection after deployment. See here for an advanced configuration guide.
Click Create Endpoint.

Step 3: Set your API base and key

The status page displays your unique API base and key under Setup. Depending on your development platform or environment, setting environment variables for the cURL command varies.

For a single project
macOS
Windows

This approach works across macOS, Windows, and Linux and allows you to specify environment variables for each project you're working on.

Create a file named .env in your project's root directory

OPENAI_API_SUFFIX

'ANYSCALE_API_BASE'

'ANYSCALE_API_KEY'

Setup

About

OPENAI_BASE_URL=ANYSCALE_API_BASE
OPENAI_API_KEY=ANYSCALE_API_KEY

Add .env to your .gitignore file
Load environment variables

Load into bash

source .env

Use python_dotenv to load .env files in Python

from dotenv import load_dotenv
load_dotenv()

To make your API key permanent and accessible to all projects without needing to add it to your application code, use the following method for macOS:

Open the Terminal application
Edit the Shell profile

echo $SHELL

bash

.bash_profile

nano ~/.bash_profile

zsh

.zshrc

nano ~/.zshrc

Add environment variables

OPENAI_API_SUFFIX

'ANYSCALE_API_BASE'

'ANYSCALE_API_KEY'

Setup

About

export OPENAI_BASE_URL=ANYSCALE_API_BASE
export OPENAI_API_KEY=ANYSCALE_API_KEY

Save and exit
Apply the changes

source ~/.bash_profile

source ~/.zshrc

Verify the setup

echo $OPENAI_API_KEY

To make your API key permanent and accessible to all projects without needing to add it to your application code, use the following method for Windows:

Open the Command Prompt application
Set permanent environment variables

OPENAI_API..

'ANYSCALE_API_BASE'

'ANYSCALE_API_KEY'

Setup

About

setx OPENAI_BASE_URL ANYSCALE_API_BASE
setx OPENAI_API_KEY ANYSCALE_API_KEY

Verify the setup

echo %OPENAI_API_KEY%

Step 4: Query the model

cURL
Python

cURL is a command-line tool that developers commonly use for making HTTP requests. After you've set-up your API key in your terminal or command prompt, send a sample request to the API with the following command:

curl -X 'POST' "$OPENAI_BASE_URL/chat/completions" \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -H 'Content-Type: application/json' \
  -d '\{
  "model": "meta-llama/Llama-2-70b-chat-hf",
  "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Who won the Australian open 2012 final, and how many sets were played?"}
  ]
}'

To make an API request using the Python library, set your environment variables where your script or notebook is running. Then, run the following example:

import os
from openai import OpenAI

client = OpenAI(
  api_key = os.environ['OPENAI_API_KEY'],
  base_url = os.environ['OPENAI_BASE_URL']
)

response = client.chat.completions.create(
  model="meta-llama/Llama-2-70b-chat-hf", # Replace with your model name.
  messages=[
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Who won the Australian open 2012 final, and how many sets were played?"}
  ]
)

print(response.choices[0].message)

Next steps

Check out the OpenAI Migration Guide to transition existing applications over from the OpenAI API to Anyscale Private Endpoints.
Further customize your model to meet your deployment, autoscaling, and text generation needs.
Use the observability tooling built into the Endpoints Server to monitor the health of deployed models and set up alerts for notable events.

Get started

Set up your account​

Cloud prerequisites​

Spot instance quotas

Standard instance quotas

Deploy an Anyscale Private Endpoint​

Step 1: Create a new Endpoint​

Step 2: Configure the deployment​

Step 3: Set your API base and key​

Step 4: Query the model​

Next steps​

Set up your account

Cloud prerequisites

Deploy an Anyscale Private Endpoint

Step 1: Create a new Endpoint

Step 2: Configure the deployment

Step 3: Set your API base and key

Step 4: Query the model

Next steps