Skip to main content

Get started

Anyscale Private Endpoints offers a streamlined interface for developers to leverage state-of-the-art open source large language models (LLMs) to power AI applications. Deploying in a private cloud environment allows teams to meet their specific privacy, control, and customization requirements.

LLM applications that you build with Private Endpoints are backed by the Ray and Anyscale and inherit robust production-ready features like zero downtime upgrades, high availability, and enhanced observability. When you're ready for enterprise-level solutions and support, the transition to the expansive capabilities of the Anyscale Platform for machine learning workloads is seamless.

Set up your account

  1. Sign up for Anyscale Private Endpoints to receive an invite code.
  2. Create an account or sign in through the Anyscale Console.

Cloud prerequisites

To use Anyscale Private Endpoints, you must satisfy the following requirements:

  1. Deploy your Anyscale Cloud.
  2. Ensure that your cloud has sufficient quota to deploy your LLMs.
info

Availability of instance types can vary by region and availability zone, so before making adjustments to your cloud quota, confirm with your cloud service provider that your selected region and zone can accommodate your instance needs.

How to update cloud quotas
Cloud service providers set quotas to prevent over-consumption and ensure fair distribution. For Anyscale Private Endpoints to run smoothly, adjust your quotas to serve LLMs. Because Anyscale's Compute Configuration prioritizes spot instances and then reverts to on-demand to keep costs down, make sure to adjust both types.

AWS (EC2)
For Amazon EC2, follow these steps.
  1. Navigate to the AWS Management Console and sign in.
  2. Open the Services dropdown menu and under the Management & Governance section, open Service Quotas
  3. Request a quota increase for the following instances. Remember that quotas are region specific, so update the relevant one.

Spot instance quotas

  • All G and VT Spot Instance Requests: Default is 0. Set to at least 512, which supports 8 G5.12xlarge and 8 G5.4xlarge spot instances.
  • All Standard (A, C, D, H, I, M, R, T, Z) Spot Instance Requests: Default is 5. Set to at least 512, which supports 16 M5.8xlarge spot instances.
  • All P4, P3, and P2 Spot Instance Requests. Default is 64: Set to at least 224, which supports 4 P3.8xlarge instances and 1 P4de.24xlarge instance.

Standard instance quotas

  • Running On-Demand G and VT instances: Default is 0. Set to at least 512, which supports 8 G5.12xlarge instances and 8 G5.4xlarge instances.
  • Running On-Demand Standard (A, C, D, H, I, M, R, T, Z) instances: Default is 5. Set to at least 544,which supports 17 M5.8xlarge instances.
  • Running On-Demand P instances: Default is 64. Set to at least 224, which supports 4 P3.8xlarge instances and 1 P4de.24xlarge instance.

GCP (GCE)
The Google Cloud Quotas you need to change are for the Compute Engine API. You can view all of the regions when searching for the following metrics, but also be aware of the All Regions quota that you may want to review.
  1. Go to Google Cloud Console
  2. and sign in.
  3. Access the IAM & Admin section and select "Quotas."
  4. Filter by the service Compute Engine API from the dropdown.
  5. Select the quotas you want to increase, adjust the numbers according to the recommendations in the following list, and click Edit Quotas at the top of the page.

Pre-emptible instance quotas

  • compute.googleapis.com/preemptible_cpus - Set this to at least 256
  • compute.googleapis.com/preemptible_nvidia_t4_gpus - Set this to at least 32

  • Optional

    • compute.googleapis.com/preemptible_nvidia_v100_gpus - Set this to at least 16
    • compute.googleapis.com/preemptible_nvidia_a100_gpus - Set this to at least 2

Standard instance quotas

  • compute.googleapis.com/CPUs - Set this to at least 256
  • compute.googleapis.com/n2_cpus - Set this to at least 128
  • compute.googleapis.com/nvidia_t4_gpus - Set this to at least 32

  • Optional

    • compute.googleapis.com/nvidia_v100_gpus - Set this to at least 16
    • compute.googleapis.com/nvidia_a100_gpus - Set this to at least 2

Deploy an Anyscale Private Endpoint

Step 1: Create a new Endpoint

Click on Endpoints server, and then Create.

Step 2: Configure the deployment

Customize your settings:

  1. Endpoint name: Fill in a unique name; the name is immutable after deployment.
  2. Endpoint version: Select the latest.
  3. Cloud name: Choose the cloud that you set up with the adjusted quotas to run your Private Endpoint in.
  4. Select models to deploy: Choose which models you would like to deploy. You can update this selection after deployment. See here for an advanced configuration guide.
  5. Click Create Endpoint.

Step 3: Set your API base and key

The status page displays your unique API base and key under Setup. Depending on your development platform or environment, setting environment variables for the cURL command varies.

This approach works across macOS, Windows, and Linux and allows you to specify environment variables for each project you're working on.

  1. Create a file named .env in your project's root directory
  2. The names of the environment variables are OPENAI_API_SUFFIX to ensure seamless compatibility with existing applications written with OpenAI APIs, but this should be your Anyscale API base and key.

    Add the following lines, replacing 'ANYSCALE_API_BASE' and 'ANYSCALE_API_KEY' with your API base and key copied from the Setup section on the About page for an endpoint:

    OPENAI_BASE_URL=ANYSCALE_API_BASE
    OPENAI_API_KEY=ANYSCALE_API_KEY
  3. Add .env to your .gitignore file
  4. Protect your API key and sensitive information by ensuring that you never accidentally commit this file to a Git repository.
  5. Load environment variables
  6. Use one of the following two options to load the environment variables:
    1. Load into bash
    2. Run the following command, which loads all the variables into the current session, allowing scripts and commands run in that session to access them:
      source .env
    3. Use python_dotenv to load .env files in Python
    4. With this library, you can use these lines of code in a Python program to load environment variables:
      from dotenv import load_dotenv
      load_dotenv()

Step 4: Query the model

cURL is a command-line tool that developers commonly use for making HTTP requests. After you've set-up your API key in your terminal or command prompt, send a sample request to the API with the following command:

curl -X 'POST' "$OPENAI_BASE_URL/chat/completions" \
-H "Authorization: Bearer $OPENAI_API_KEY" \
-H 'Content-Type: application/json' \
-d '{
"model": "meta-llama/Llama-2-7b-chat-hf",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Who won the Australian open 2012 final and how many sets were played?"}
]
}'

Next steps

  • Check out the OpenAI Migration Guide to transition existing applications over from the OpenAI API to Anyscale Private Endpoints.
  • Further customize your model to meet your deployment, autoscaling, and text generation needs.
  • Use the observability tooling built into the Endpoints Server to monitor the health of deployed models and set up alerts for notable events.