Skip to main content
Version: 1.0.0

Get started

Check your docs version

Anyscale is rolling out a new design. If you have preview access to the enhanced experience, use the latest version of the docs and see the migration guide for transitioning.

Anyscale Private Endpoints offers a streamlined interface for developers to leverage state-of-the-art open source large language models (LLMs) to power AI applications. Deploying in a private cloud environment allows teams to meet their specific privacy, control, and customization requirements.

LLM applications that you build with Private Endpoints are backed by the Ray and Anyscale and inherit robust production-ready features like zero downtime upgrades, high availability, and enhanced observability. When you're ready for enterprise-level solutions and support, the transition to the expansive capabilities of the Anyscale Platform for machine learning workloads is seamless.

Set up your account

  1. Sign up for Anyscale Private Endpoints to receive an invite code.
  2. Create an account or sign in through the Anyscale Console.

Cloud prerequisites

To use Anyscale Private Endpoints, you must satisfy the following requirements:

  1. Deploy an Anyscale Cloud.
  2. Ensure that this cloud has sufficient quota to deploy your LLMs.
☁️Cloud quotas

For Anyscale Private Endpoints to serve your models, you must modify the default resource quotas set by your cloud service provider.

Note: The availability of instances can vary by region and zone, so confirm with your cloud service provider that your selection can accommodate your instance needs.

How to update AWS quotas

For Amazon EC2, follow these steps.

  1. Navigate to the AWS Management Console and sign in.
  2. Open the Services dropdown menu and under the Management & Governance section, open Service Quotas
  3. Request a quota increase for the following instances. Remember that quotas are region specific, so update the relevant one.

Spot instance quotas

  • All G and VT Spot Instance Requests: Default is 0. Set to at least 512, which supports 8 G5.12xlarge and 8 G5.4xlarge spot instances.
  • All Standard (A, C, D, H, I, M, R, T, Z) Spot Instance Requests: Default is 5. Set to at least 512, which supports 16 M5.8xlarge spot instances.
  • All P4, P3, and P2 Spot Instance Requests. Default is 64: Set to at least 224, which supports 4 P3.8xlarge instances and 1 P4de.24xlarge instance.

Standard instance quotas

  • Running On-Demand G and VT instances: Default is 0. Set to at least 512, which supports 8 G5.12xlarge instances and 8 G5.4xlarge instances.
  • Running On-Demand Standard (A, C, D, H, I, M, R, T, Z) instances: Default is 5. Set to at least 544,which supports 17 M5.8xlarge instances.
  • Running On-Demand P instances: Default is 64. Set to at least 224, which supports 4 P3.8xlarge instances and 1 P4de.24xlarge instance.
How to update GCP quotas
  1. Navigate to the Google Cloud Quotas page
  2. Filter the quotas by each quota metric listed below
  3. Click the checkbox from the same region as your Anyscale Cloud
  4. Select EDIT QUOTAS to enter a new limit.
Quota MetricMinimum Recommended Quota CPUs GPUs GPUs GPUs CPUs CPUs GPUs GPUs GPUs
Persistent Disk SSD (GB)1000 GB

Deploy an Anyscale Private Endpoint

Step 1: Create a new Endpoint

Click on Endpoints server, and then Create.

Step 2: Configure the deployment

Customize your settings:

  1. Endpoint name: Fill in a unique name; the name is immutable after deployment.
  2. Endpoint version: Select the latest.
  3. Cloud name: Choose the cloud that you set up with the adjusted quotas to run your Private Endpoint in.
  4. Select models to deploy: Choose which models you would like to deploy. You can update this selection after deployment. See here for an advanced configuration guide.
  5. Click Create Endpoint.

Step 3: Set your API base and key

The status page displays your unique API base and key under Setup. Depending on your development platform or environment, setting environment variables for the cURL command varies.

This approach works across macOS, Windows, and Linux and allows you to specify environment variables for each project you're working on.

  1. Create a file named .env in your project's root directory
  2. The names of the environment variables are OPENAI_API_SUFFIX to ensure seamless compatibility with existing applications written with OpenAI APIs, but this should be your Anyscale API base and key.

    Add the following lines, replacing 'ANYSCALE_API_BASE' and 'ANYSCALE_API_KEY' with your API base and key copied from the Setup section on the About page for an endpoint:

  3. Add .env to your .gitignore file
  4. Protect your API key and sensitive information by ensuring that you never accidentally commit this file to a Git repository.
  5. Load environment variables
  6. Use one of the following two options to load the environment variables:

    1. Load into bash
    2. Run the following command, which loads all the variables into the current session, allowing scripts and commands run in that session to access them:
      source .env
    3. Use python_dotenv to load .env files in Python
    4. With this library, you can use these lines of code in a Python program to load environment variables:
      from dotenv import load_dotenv

Step 4: Query the model

cURL is a command-line tool that developers commonly use for making HTTP requests. After you've set-up your API key in your terminal or command prompt, send a sample request to the API with the following command:

curl -X 'POST' "$OPENAI_BASE_URL/chat/completions" \
-H "Authorization: Bearer $OPENAI_API_KEY" \
-H 'Content-Type: application/json' \
-d '\{
"model": "meta-llama/Llama-2-70b-chat-hf",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Who won the Australian open 2012 final, and how many sets were played?"}

Next steps

  • Check out the OpenAI Migration Guide to transition existing applications over from the OpenAI API to Anyscale Private Endpoints.
  • Further customize your model to meet your deployment, autoscaling, and text generation needs.
  • Use the observability tooling built into the Endpoints Server to monitor the health of deployed models and set up alerts for notable events.