Skip to main content

Frequently Asked Questions

You can find information about Services and Ray Serve below. Check out the Ray docs and the Ray forum for more Serve information.

  1. Ray Serve and Anyscale Services
  2. Deploy an Anyscale Service
  3. Autoscaling
  4. Miscellaneous

Ray Serve and Anyscale Services

What is Ray Serve?

Ray Serve is a scalable model serving library for building online inference applications. Ray Serve is framework agnostic, so you can use a single toolkit to serve everything from deep learning models built with frameworks like PyTorch, TensorFlow, and Keras, to Scikit-Learn models, to arbitrary Python business logic. The flexibility of Ray Serve allows you to bring any model optimization such as TensorRT, vLLM, DeepSpeed. Ray Serve ensures effortless scaling across heterogeneous machines and provides flexible scheduling support to maximize hardware utilization.

Ray Serve is particularly well suited for model composition and many model serving, enabling you to build a complex inference service consisting of multiple ML models and business logic all in Python code. Some of the most valuable features Ray Serve has to offer include

  • Model Composition: Compose many individual models and business logic that are often required for building AI applications. Each of these components can be scaled independently.
  • Model Multiplexing: Enable efficient utilization of cloud resources by multiplexing models inside a pool of deployment replicas. This feature is useful in cases where you have many models with similar shape, but different weights, that you invoke sparsely.
  • Multiple Applications: Configure separate applications with their own deployment replicas that can be deployed or upgraded separately. Multi-app support is particularly powerful for use cases with many independent models deployed within one cluster to maximize hardware utilization. You can easily add, delete, or update models in one application without affecting other applications.
  • Autoscaling: Ray Serve supports dynamically scaling the resources for a model up and down by adjusting the number of replicas.
  • Resource Allocation: Ray Serve supports a flexible resource allocation model, including fractional GPUs, that enables you to serve models on limited hardware resources.
  • Dynamic Request Batching: Improve throughput without sacrificing latency goals.
  • LLM Features: Accommodating the needs of more complex AI models, features like streaming responses have been added, making Ray Serve the best way to deploy Generative AI and LLM Applications.
  • Runtime Variables: When writing an application, there are often parameters that may change from development to production. For example, a path to trained model weights that need to be updated with a new model. Runtime variables and arguments allow for easy changes without changing code.

Find more information in the Ray Serve Documentation.

What is an Anyscale Production Service?

Anyscale Production Services are made to support Ray Serve Application deployments. Anyscale Services offer additional benefits on top of Ray Serve, including high availability and zero-downtime upgrades.

Deploy an Anyscale Service

Can I use any Ray version to deploy an Anyscale Service?

No, Services requires Ray version 2.3 or later.

What permissions are required for Services?

Anyscales follows a principle of least privilege for permissions. See the scope of permissions here: IAM Permissions

What cloud resources are set up when I deploy Services?

Redis in-memory store and load balancing resources are all required to support Services. Learn more about the architecture here: Services Architecture

These resources use the Service name when deployed and are tagged in their metadata with the Service name for easy lookup from the Cloud Provider console.

Is the Load Balancer public or private?

Anyscale supports both public and private load balancers. The Load Balancer depends on the networking settings for the cloud - that is when using private networking clouds, the Load Balancer is private. You can learn more about how to configure subnets in your cloud in the Cloud Deployment doc.

How much do Anyscale Production Services cost?

Anyscale Production Services are charged based on runtime usage. Services also support scaling down to 0 to minimize costs.

You can find information on cloud costs related to Services below:



How many Anyscale Services can I deploy in a single cloud?

There is a quota of 20 running Services per cloud. Each Service may have many deployments and can scale to greater than 2,000 nodes link. If you need to increase your quota, please reach out to Anyscale support.

Are there naming restrictions for an Anyscale Service?

Anyscale service names come with certain limitations:

  • Names are limited to using alphanumeric characters, underscores, and dashes.
  • The maximum character count for a name is 57 characters.
  • Anyscale service names must be unique within a project.
    • Note that multiple services with the same name can coexist within an organization.

What does the development process look like for Anyscale Services?

The development flow with Anyscale is both efficient and straightforward. Developers may use Anyscale Workspaces and their chosen IDE to build their Serve app. Workspaces provide a familiar remote dev-box experience to rapidly iterate, debug, and test their Ray Serve application while developing.

Once ready, the applications may be deployed to a Production Service via a single command. Anyscale packages the necessary dependencies, generates the configuration from the Serve app, and deploys it to an auto-scaling cluster preconfigured with monitoring.

Users may even clone a running Service to ensure perfect reproducibility in the event of issues or errors.

For a more in-depth walk-through on the development process, please refer to the documentation here: Development Workflow with Services

I am an OSS Ray Serve user, how do I migrate to Anyscale?

Moving to Anyscale production Services is simple. OSS users can use ray serve build to construct their ray_serve configuration. Then add the configuration into an Anyscale production Service YAML. An example of how to do this is shown below.

Finally, export your Anyscale credentials and run anyscale service rollout -f my_production_service.yaml to deploy the application as an Anyscale Service. For more information on deploying an Anyscale Service, refer to the Getting Started guide.

import_path: serve_hello:entrypoint

runtime_env: {
"working_dir": ".",
"pip": ["requests", "torch>=1.4.0"]


port: 8000


- name: HelloWorld
route_prefix: /
name: my_service
- name: my_service
import_path: serve_hello:entrypoint
working_dir: .
pip: [requests, torch>=1.4.0]

Refer to Configure Anyscale Service to learn more about configuring a production Service YAML.

How do I configure dependencies?

You can configure dependencies for your Service in three different ways:

  1. The most common approach is by adding your dependencies in a cluster environment
  2. Dependencies can also be defined in a custom docker image using bring your own docker environment
  3. Finally, you have the option to define dependencies through a runtime environment

How do I specify compute resources?

You can find more information on how to define your cluster's compute resources in the Compute Configs documentation. You can also refer to Configure Anyscale Service to learn how to define the compute resources in the Service YAML.


Anyscale also supports some ML serving specific accelerators like AWS Inferentia. These are (of course) only available if your Anyscale Cloud is running on top of the relevant Cloud provider.

For a complete list of available Instance Types please see here

As of Nov, 21 2023 Inferentia Instance Types are in Private Preview. We've also published two Workspace Templates; one for Stable Diffusion XL and another one for Llama2-7b.

How do I optimize performance?

Please refer to the Performance Tuning documentation for suggested techniques on optimizing your Ray application.

Can I run my Service on spot instances?

Users may elect to run Anyscale Production Services on Spot, On-Demand, or Spot Fallback to On-Demand. However, running on spot incurs some risk of service interruption due to spot eviction.

How can I monitor resources used for Anyscale Production Services?

Users may use resource logs and Ray Dashboard to monitor resources, including node count and utilization.

If desired, users may also choose to navigate to their cloud provider console and view resources. Resources created by an Anyscale Production Service include the Service name when deployed and include metadata on cloud resources for easy identification.


What are Zero-Downtime Upgrades?

Zero-downtime upgrades enable our customers to perform upgrades and deployments without any service interruption, ensuring a consistent and reliable user experience. Zero-downtime upgrades can be accomplished using automatic rollouts, manual rollouts, and in-place upgrades.

See more information in the Upgrade a Service documentation.

How do Canary Deployments work? How long does it take to deploy a new version?

Canary deployments allow you to test the latest application versions in the production environment with a small portion of users, reducing the risk of a full-scale rollout. Traffic is routed between two clusters, the primary and canary, during rollout to ensure a smooth experience and transition.

Rollouts on Anyscale production Services perform a canary rollout by default. The Serve controller will first start the new cluster, and increment traffic every 10 seconds. On AWS, traffic is shifted in increments in the following order: 0%, 10%, 25%, 50%, 100%. A full deployment rollout may take up to 1-2 minutes. On GCP, traffic is shifted in increments in the following order: 0%, 10%, 50%, 100%. A full deployment rollout may take up to 8-10 minutes.

Users may also manually perform a canary rollout which allows the user to specify a target value (canary percent) of traffic directed to the new version.

Do I have to do a canary rollout to upgrade my Service?

No - users can take advantage of the in-place upgrade feature. This is particularly helpful during development where you do not want to wait for start up time of a new cluster. In-place upgrades will not start a separate cluster but rather upgrade the Service in the existing cluster.

Note that when doing in place upgrades users should primarily be modifying application code, NOT infrastructure changes. Users can only modify the ray_serve_config field of their Service YAML.

Learn more here: Upgrade a Service

When deploying a new version can I perform a rollback?

Yes users may perform a rollback immediately. A Rollback may only be performed while the original cluster is alive and the deployment is in progress.

What is Model Multiplexing, and how does it improve cloud resource usage?

Model multiplexing allows you to host multiplex AI models inside a pool of deployment replicas. This optimizes cloud resources and increases the throughput of model serving.

Learn more here: Model Multiplexing

What is Multi-app Services, and how is different than Multiplexing?

Multi-app empowers developers to deploy and manage multiple applications in the same cluster. Multi-app is useful when you need to deploy multiple models as standalone deployments and upgrade them independently. In multi-app deployments, models may use different runtime environments. Multiplexing is a technique to dynamically serve many models on the same deployment. In model multiplexing, each model must use the same runtime environment/interface.

Learn more here: Multiple Serve Applications

Can I deploy a frontend with Anyscale Production Services? Or is it only for backend models and endpoints?

Absolutely. Users can use CloudFront and Gradio/Streamlit to build frontends for their applications.

When using Gradio users will need to pass a function to GradioIngress instead of passing a Gradio Interface object. You can find more information on deploying Gradio with Ray serve in Scaling your Gradio app with Ray Serve in the Ray docs.

The documentation provides more examples. Check out Anyscale Aviary which is an example of an Anyscale Production Service with a Gradio frontend.


How do I integrate FastAPI into my Service?

You can find more information in FastAPI HTTP Deployments in the Ray Serve docs. Once your production Service is deployed, you can access the automatic OpenAPI documentation from the UI.

<service_name>-<5 digit random hex>.<cloud_id>

Why am I experiencing timeouts?

There a few avenues you should explore when investigating probable root-cause of the observed timeouts:

  • Load Balancer Idle Connection timeouts (AWS): If your requests take more than 60 seconds (the default AWS ELB idle connection timeout) to handle then you might observe 503 HTTP response codes returned when calling Anyscale Service. To increase this default timeout, please reach out to Anyscale Support and we’ll adjust this to a higher value (for your organization).

  • Load Balancer timeouts (GCP): The timeout for GCP is set to 600 seconds by default for request handling. Note, that this timeout is different from idle connection timeout in that it sets a hard deadline for request handling which will timeout irrespective of whether your application is sending response back or not (idle timeout will occur only on connections that didn’t see any response sent back)

How can I query my Service? What about querying a specific version when doing A/B testing?

With Services, users can navigate to the Anyscale Console UI and click on their Service. A drop down will provide a CLI and python script to query the Service endpoint. If doing a rollout, users may choose to query the primary or canary version.

Can I configure the URL slug and endpoint of an Anyscale Service?

Users can provide a name when deploying an Anyscale Production Service that is integrated into the host header.

The format for the host header:

<service_name>-<5 digit random hex>.<cloud_id>

Can I remove the bearer token from my Anyscale Service?

Users have the option to toggle the bearer token for individual services by providing an access configuration in the Service YAML. By default, services are deployed with a bearer token, but it can be disabled by appending the following configuration to the Service YAML.

use_bearer_token: False

The bearer token can be toggled for a running Service by rolling out a new Service Version. It can take up to 5 minutes for the changes to be propagated, so exercise caution when toggling the bearer token for Services in production.

Where can I learn more about the architecture of Services?

See Services Architecture to learn more.