What are Anyscale services?
Anyscale services deploy Ray Serve applications to production endpoints. Anyscale services offer additional benefits on top of Ray Serve, including high availability and zero-downtime upgrades.
Best practices
- Distribute replicas across nodes and availability zones. See Configure replica scaling for Anyscale services.
- Use head node fault tolerance. See Configure head node fault tolerance
- Avoid scheduling on the head node. See Control head node scheduling.
- Configure retries and timeouts to control latency and backpressure. See Manage timeouts and retries for Anyscale services.
Anyscale service features
The following table provides an overview of features of Anyscale services. These features extend the basic functionality of Ray Serve. See What is Ray Serve?.
Feature | Description |
---|---|
Fast autoscaling and model loading | RayTurbo Serve's fast model loading capabilities and startup time optimizations improve auto-scaling and cluster startup capabilities. In certain experiments, the end-to-end scaling time for Llama-3-70B is 5.1x faster on Anyscale compared to open source Ray. |
High QPS serving | RayTurbo provides an optimized version of Ray Serve to achieve up to 54% higher QPS and up-to 3x streaming tokens per second for high traffic serving use-cases. |
Replica compaction | RayTurbo Serve migrates replicas into fewer nodes where possible to reduce resource fragmentation and improve hardware utilization. Replica compaction is enabled by default. Learn more in this blog. |
Zero-downtime incremental rollouts | RayTurbo allows you to perform incremental rollouts and canary upgrades for robust production service management. Unlike KubeRay and open source Ray Serve, RayTurbo performs upgrades with rollback procedures without requiring 2x the hardware capacity. |
Observability | RayTurbo provides custom metric dashboards, log search, tracing, and alerting, for comprehensive observability into your production services. It also has the ability to export logs, metrics, and traces to your observability tooling like Datadog, etc. |
Multi availability zone services | RayTurbo enables availability-zone aware scheduling of Ray Serve replicas to provide higher redundancy to availability zone failures. |
Containerized runtime environments | RayTurbo configures different container images for different Ray Serve deployments allowing you to prepare dependencies needed per model separately. It comes with all the fast container optimizations included in fast auto-scaling as well as improved security posture over open source Ray Serve, since it doesn’t require installing Podman and running with root permissions. |
FastAPI integration | You can optionally use FastAPI to control HTTP handling logic. When you build your application with FastAPI, the Anyscale console includes a link to FastAPI documentation that lets you run sample queries against your defined routes.See Ray docs on FastAPI HTTP Deployments. |
Support for spot instances with preemption | Anyscale services support using spot instances, on-demand instances, or spot instances with fallback to on-demand.When you use spot instances with fallback to on-demand, Anyscale reacts to the 2 minute spot preemption warning from the cloud provider by attempting to spin up and migrate replicas to on-demand instances, resulting in little to no downtime while benefitting from cost savings associated with spot instances. |
What is Ray Serve?
Ray Serve is a scalable model serving library for building online inference applications. Ray Serve is framework agnostic, so you can use a single toolkit to serve everything from deep learning models built with frameworks like PyTorch, TensorFlow, and Keras, to Scikit-Learn models, to arbitrary Python business logic. The flexibility of Ray Serve allows you to bring any model optimization such as TensorRT, vLLM, DeepSpeed. Ray Serve ensures effortless scaling across heterogeneous machines and provides flexible scheduling support to maximize hardware utilization.
Ray Serve is particularly well suited for model composition and many model serving, enabling you to build a complex inference service consisting of multiple ML models and business logic all in Python code. Some of the most valuable features Ray Serve has to offer include:
Feature | Description |
---|---|
Model Composition | Compose many individual models and business logic that are often required for building AI applications. You can scale each of these components independently. |
Model Multiplexing | Enable efficient utilization of cloud resources by multiplexing models inside a pool of deployment replicas. This feature is useful in cases where you have many models with similar shape, but different weights, that you invoke sparsely. |
Multiple Applications | Configure separate applications with their own deployment replicas that you can deploy or upgrade separately. Multi-app support is particularly powerful for use cases with many independent models deployed within one cluster to maximize hardware utilization. You can easily add, delete, or update models in one application without affecting other applications. |
Autoscaling | Ray Serve supports dynamically scaling the resources for a model up and down by adjusting the number of replicas. |
Resource Allocation | Ray Serve supports a flexible resource allocation model, including fractional GPUs, that enables you to serve models on limited hardware resources. |
Dynamic Request Batching | Improve throughput without sacrificing latency goals. |
Ray Serve LLMs | Accommodating the needs of more complex AI models, Ray Serve has features like streaming responses, making it the best way to deploy generative AI and LLM applications. |
Find more information in the Ray Serve documentation.
Permission requirements
You can optionally deploy an Anyscale cloud without permissions to deploy services, and you must opt-in to support head node fault tolerance for all cloud deployment options.
Cloud infrastructure to support Anyscale services uses the networking you configure while deploying your Anyscale cloud. Deploy a cloud with private networking if you need your load balancer to be private.
Services require additional IAM permissions in your cloud provider account to configure a Redis in-memory store and load balancer. See IAM permissions for AWS or Google Cloud.
The following table provides an overview of support for services with different cloud deployment options:
Cloud deployment | Deployment method | Details |
---|---|---|
Serverless Anyscale cloud (also called Anyscale-hosted cloud) | Deployed by default | Enables services by default. No support for head node fault tolerance or private networking. |
Anyscale cloud on AWS | anyscale cloud setup | Enables services by default. Opt-in to head node fault tolerance using the --enable-head-node-fault-tolerance flag. |
Anyscale cloud on AWS | anyscale cloud register | You must configure IAM roles and a MemoryDB instance when deploying your cloud. Contact Anyscale support for assistance customizing the Anyscale Terraform modules for AWS. |
Anyscale cloud on Google Cloud | anyscale cloud setup | Enables services by default. Opt-in to head node fault tolerance using the --enable-head-node-fault-tolerance flag. |
Anyscale cloud on Google Cloud | anyscale cloud register | You must configure service account roles and a Memorystore instance when deploying your cloud. Contact Anyscale support for assistance customizing the Anyscale Terraform modules for Google Cloud. |
Anyscale cloud on Kubernetes | anyscale cloud register | You must configure custom permissions and a Redis in-memory store when deploying your cloud. Contact Anyscale support for assistance customizing the Anyscale Terraform modules for Kubernetes. |
Anyscale clouds on AWS have changed default behavior for deploying Anyscale services.
Legacy Anyscale clouds on AWS use CloudFormation to configure Elastic Load Balancing for your service. Anyscale now directly configures Elastic Load Balancing for your services.
All new Anyscale clouds on AWS deployed with anyscale cloud setup
use this configuration by default. You can run anyscale cloud update
to upgrade your legacy AWS clouds deployed with anyscale cloud setup
to the new behavior.
Anyscale has updated the Anyscale Terraform modules for AWS to provide the proper IAM permissions for the new default behavior. If you have a legacy Anyscale cloud deployed using anyscale cloud register
, contact Anyscale support for assistance updating your cloud IAM permissions.
See Update your IAM role for services on Anyscale clouds on AWS.
Capacity limit
There's a quota of 20 running services per Anyscale cloud. A service can have many deployments and can scale to greater than 2000 nodes. If you need to increase your quota, contact Anyscale support.
Pricing
Services use standard Anyscale pricing based on the type of machines used. See the Anyscale pricing page.
In addition to Anyscale costs and virtual machine costs, Anyscale uses load balancer resources and a Redis-compatible in-memory store in your cloud provider account.
Use the following links to learn about pricing details for these services:
Cloud | Pricing links |
---|---|
AWS | |
GCP |