Generate an embedding

Changes to Anyscale Endpoints API

Effective August 1, 2024 Anyscale Endpoints API will be available exclusively through the fully Hosted Anyscale Platform. Multi-tenant access to LLM models will be removed.

With the Hosted Anyscale Platform, you can access the latest GPUs billed by the second, and deploy models on your own dedicated instances. Enjoy full customization to build your end-to-end applications with Anyscale. Get started today.

Setup your environment

Create an API key on the Credentials page under your account.

Set the following environment variables.

export ANYSCALE_BASE_URL="https://api.endpoints.anyscale.com/v1"
export ANYSCALE_API_KEY="esecret_YOUR_API_KEY"

You can find more details about authentication here.

Select a model

tip

Start with the 70B version, and then work your way down to the smaller models.

Anyscale supports the following models:

Query an embedding model

tip

If you are starting a project from scratch, use the OpenAI Python SDK instead of cURL or Python.

Embedding models

The following is an example of querying the thenlper/gte-large embedding model.

import openai

client = openai.OpenAI(
    base_url = "https://api.endpoints.anyscale.com/v1",
    api_key = "esecret_YOUR_API_KEY"
)

# Note: not all arguments are currently supported and will be ignored by the backend.
embedding = client.embeddings.create(
    model="thenlper/gte-large",
    input="Your text string goes here",
)
print(embedding.model_dump())

The output looks like the following:

{
    'data': [
        {'embedding': [...],
         'index': 0,
         'object': 'embedding'
         }
     ],
     'model': 'thenlper/gte-large',
     'object': 'list',
     'usage': {
          'prompt_tokens': 7,
          'total_tokens': 7
     },
     'id': 'thenlper/gte-large-UEpQEaduAoaC6rq5n1yxkYNalVukLBhMzkG7IV_GPgU',
     'created': 1701325873
}

Rate limiting

Anyscale Endpoints rate limits work a little differently than other comparable platforms. The limits are based on the number of concurrent requests in flight, not on the number of tokens or requests per second. Meaning you aren't limited in the number of requests you send, but based on how many you send at once.

The current default limit is 30 concurrent requests. Reach out to endpoints-help@anyscale.com if you have a use case that needs more.

Setup your environment​

Select a model​

Query an embedding model​

Embedding models​

Rate limiting​

Setup your environment

Select a model

Query an embedding model

Embedding models

Rate limiting