Skip to main content

What are Anyscale services?

Anyscale services deploy Ray Serve applications to production endpoints. Anyscale services offer additional benefits on top of Ray Serve, including high availability and zero-downtime upgrades.

Best practices

Anyscale service features

The following table provides an overview of features of Anyscale services. These features extend the basic functionality of Ray Serve. See What is Ray Serve?.

FeatureDescription
Fast autoscaling and model loadingRayTurbo Serve's fast model loading capabilities and startup time optimizations improve auto-scaling and cluster startup capabilities. In certain experiments, the end-to-end scaling time for Llama-3-70B is 5.1x faster on Anyscale compared to open source Ray.
High QPS servingRayTurbo provides an optimized version of Ray Serve to achieve up to 54% higher QPS and up-to 3x streaming tokens per second for high traffic serving use-cases.
Replica compactionRayTurbo Serve migrates replicas into fewer nodes where possible to reduce resource fragmentation and improve hardware utilization. Replica compaction is enabled by default. Learn more in this blog.
Zero-downtime incremental rolloutsRayTurbo allows you to perform incremental rollouts and canary upgrades for robust production service management. Unlike KubeRay and open source Ray Serve, RayTurbo performs upgrades with rollback procedures without requiring 2x the hardware capacity.
ObservabilityRayTurbo provides custom metric dashboards, log search, tracing, and alerting, for comprehensive observability into your production services. It also has the ability to export logs, metrics, and traces to your observability tooling like Datadog, etc.
Multi availability zone servicesRayTurbo enables availability-zone aware scheduling of Ray Serve replicas to provide higher redundancy to availability zone failures.
Containerized runtime environmentsRayTurbo configures different container images for different Ray Serve deployments allowing you to prepare dependencies needed per model separately. It comes with all the fast container optimizations included in fast auto-scaling as well as improved security posture over open source Ray Serve, since it doesn’t require installing Podman and running with root permissions.
FastAPI integrationYou can optionally use FastAPI to control HTTP handling logic. When you build your application with FastAPI, the Anyscale console includes a link to FastAPI documentation that lets you run sample queries against your defined routes.

See Ray docs on FastAPI HTTP Deployments.
Support for spot instances with preemptionAnyscale services support using spot instances, on-demand instances, or spot instances with fallback to on-demand.

When you use spot instances with fallback to on-demand, Anyscale reacts to the 2 minute spot preemption warning from the cloud provider by attempting to spin up and migrate replicas to on-demand instances, resulting in little to no downtime while benefitting from cost savings associated with spot instances.

What is Ray Serve?

Ray Serve is a scalable model serving library for building online inference applications. Ray Serve is framework agnostic, so you can use a single toolkit to serve everything from deep learning models built with frameworks like PyTorch, TensorFlow, and Keras, to Scikit-Learn models, to arbitrary Python business logic. The flexibility of Ray Serve allows you to bring any model optimization such as TensorRT, vLLM, DeepSpeed. Ray Serve ensures effortless scaling across heterogeneous machines and provides flexible scheduling support to maximize hardware utilization.

Ray Serve is particularly well suited for model composition and many model serving, enabling you to build a complex inference service consisting of multiple ML models and business logic all in Python code. Some of the most valuable features Ray Serve has to offer include:

FeatureDescription
Model CompositionCompose many individual models and business logic that are often required for building AI applications. You can scale each of these components independently.
Model MultiplexingEnable efficient utilization of cloud resources by multiplexing models inside a pool of deployment replicas. This feature is useful in cases where you have many models with similar shape, but different weights, that you invoke sparsely.
Multiple ApplicationsConfigure separate applications with their own deployment replicas that you can deploy or upgrade separately. Multi-app support is particularly powerful for use cases with many independent models deployed within one cluster to maximize hardware utilization. You can easily add, delete, or update models in one application without affecting other applications.
AutoscalingRay Serve supports dynamically scaling the resources for a model up and down by adjusting the number of replicas.
Resource AllocationRay Serve supports a flexible resource allocation model, including fractional GPUs, that enables you to serve models on limited hardware resources.
Dynamic Request BatchingImprove throughput without sacrificing latency goals.
Ray Serve LLMsAccommodating the needs of more complex AI models, Ray Serve has features like streaming responses, making it the best way to deploy generative AI and LLM applications.

Find more information in the Ray Serve documentation.

Permission requirements

You can optionally deploy an Anyscale cloud without permissions to deploy services, and you must opt-in to support head node fault tolerance for all cloud deployment options.

Cloud infrastructure to support Anyscale services uses the networking you configure while deploying your Anyscale cloud. Deploy a cloud with private networking if you need your load balancer to be private.

Services require additional IAM permissions in your cloud provider account to configure a Redis in-memory store and load balancer. See IAM permissions for AWS or Google Cloud.

The following table provides an overview of support for services with different cloud deployment options:

Cloud deploymentDeployment methodDetails
Serverless Anyscale cloud (also called Anyscale-hosted cloud)Deployed by defaultEnables services by default. No support for head node fault tolerance or private networking.
Anyscale cloud on AWSanyscale cloud setupEnables services by default. Opt-in to head node fault tolerance using the --enable-head-node-fault-tolerance flag.
Anyscale cloud on AWSanyscale cloud registerYou must configure IAM roles and a MemoryDB instance when deploying your cloud. Contact Anyscale support for assistance customizing the Anyscale Terraform modules for AWS.
Anyscale cloud on Google Cloudanyscale cloud setupEnables services by default. Opt-in to head node fault tolerance using the --enable-head-node-fault-tolerance flag.
Anyscale cloud on Google Cloudanyscale cloud registerYou must configure service account roles and a Memorystore instance when deploying your cloud. Contact Anyscale support for assistance customizing the Anyscale Terraform modules for Google Cloud.
Anyscale cloud on Kubernetesanyscale cloud registerYou must configure custom permissions and a Redis in-memory store when deploying your cloud. Contact Anyscale support for assistance customizing the Anyscale Terraform modules for Kubernetes.
important

Anyscale clouds on AWS have changed default behavior for deploying Anyscale services.

Legacy Anyscale clouds on AWS use CloudFormation to configure Elastic Load Balancing for your service. Anyscale now directly configures Elastic Load Balancing for your services.

All new Anyscale clouds on AWS deployed with anyscale cloud setup use this configuration by default. You can run anyscale cloud update to upgrade your legacy AWS clouds deployed with anyscale cloud setup to the new behavior.

Anyscale has updated the Anyscale Terraform modules for AWS to provide the proper IAM permissions for the new default behavior. If you have a legacy Anyscale cloud deployed using anyscale cloud register, contact Anyscale support for assistance updating your cloud IAM permissions.

See Update your IAM role for services on Anyscale clouds on AWS.

Capacity limit

There's a quota of 20 running services per Anyscale cloud. A service can have many deployments and can scale to greater than 2000 nodes. If you need to increase your quota, contact Anyscale support.

Pricing

Services use standard Anyscale pricing based on the type of machines used. See the Anyscale pricing page.

In addition to Anyscale costs and virtual machine costs, Anyscale uses load balancer resources and a Redis-compatible in-memory store in your cloud provider account.

Use the following links to learn about pricing details for these services:

CloudPricing links
AWS
GCP