Frequently Asked Questions
You can find information about Services and Ray Serve below. Check out the Ray docs for more Ray Serve information.
Ray Serve and Anyscale Services
What is Ray Serve?
Ray Serve is a scalable model serving library for building online inference applications. Ray Serve is framework agnostic, so you can use a single toolkit to serve everything from deep learning models built with frameworks like PyTorch, TensorFlow, and Keras, to Scikit-Learn models, to arbitrary Python business logic. The flexibility of Ray Serve allows you to bring any model optimization such as TensorRT, vLLM, DeepSpeed. Ray Serve ensures effortless scaling across heterogeneous machines and provides flexible scheduling support to maximize hardware utilization.
Ray Serve is particularly well suited for model composition and many model serving, enabling you to build a complex inference service consisting of multiple ML models and business logic all in Python code. Some of the most valuable features Ray Serve has to offer include
- Model Composition: Compose many individual models and business logic that are often required for building AI applications. Each of these components can be scaled independently.
- Model Multiplexing: Enable efficient utilization of cloud resources by multiplexing models inside a pool of deployment replicas. This feature is useful in cases where you have many models with similar shape, but different weights, that you invoke sparsely.
- Multiple Applications: Configure separate applications with their own deployment replicas that can be deployed or upgraded separately. Multi-app support is particularly powerful for use cases with many independent models deployed within one cluster to maximize hardware utilization. You can easily add, delete, or update models in one application without affecting other applications.
- Autoscaling: Ray Serve supports dynamically scaling the resources for a model up and down by adjusting the number of replicas.
- Resource Allocation: Ray Serve supports a flexible resource allocation model, including fractional GPUs, that enables you to serve models on limited hardware resources.
- Dynamic Request Batching: Improve throughput without sacrificing latency goals.
- LLM Features: Accommodating the needs of more complex AI models, features like streaming responses have been added, making Ray Serve the best way to deploy Generative AI and LLM Applications.
- Runtime Variables: When writing an application, there are often parameters that may change from development to production. For example, a path to trained model weights that need to be updated with a new model. Runtime variables and arguments allow for easy changes without changing code.
Find more information in the Ray Serve Documentation.
What is an Anyscale Production Service?
Anyscale Production Services are made to support Ray Serve Application deployments. Anyscale Services offer additional benefits on top of Ray Serve, including high availability and zero-downtime upgrades.
Deploy an Anyscale Service
What permissions are required for Services?
Anyscales follows a principle of least privilege for permissions. See the scope of permissions here: IAM Permissions
What cloud resources are set up when I deploy Services?
Redis in-memory store and load balancing resources are all required to support Services. These resources use the Service name when deployed and are tagged in their metadata with the Service name for easy lookup from the Cloud Provider console.
Is the Load Balancer public or private?
Anyscale supports both public and private load balancers. The Load Balancer depends on the networking settings for the cloud - that is when using private networking clouds, the Load Balancer is private.
How much do Anyscale Production Services cost?
Anyscale Production Services are charged based on hardware usage. Services also support scaling down to 0 to minimize costs. Besides the instance usage costs, there are costs associated with load balancer resources and MemoryDB (for head node fault tolerance).
You can find information on these costs below:
AWS:
GCP:
How many Anyscale Services can I deploy in a single cloud?
There is a quota of 20 running Services per cloud. Each Service may have many deployments and can scale to greater than 2,000 nodes link. If you need to increase your quota, please reach out to Anyscale support.
Are there naming restrictions for an Anyscale Service?
Anyscale service names come with certain limitations:
- Names are limited to using alphanumeric characters, underscores, and dashes.
- The maximum character count for a name is 57 characters.
- Anyscale service names must be unique within a project.
- Note that multiple services with the same name can coexist within an organization.
How do I configure dependencies?
You can configure dependencies for your Service in three different ways:
- The most common approach is by adding your dependencies in a container image
- Dependencies can also be defined in a custom docker image using bring your own docker environment
- Finally, you have the option to define dependencies through a runtime environment
How do I specify compute resources?
You can find more information on how to define your cluster's compute resources in the Compute Configs documentation.
How do I optimize performance?
Please refer to the Performance Tuning documentation for suggested techniques on optimizing your Ray application.
Can I run my Service on Spot instances?
Users may elect to run Anyscale Production Services on Spot, On-Demand, or Spot Fallback to On-Demand. We use the 2 minute Spot pre-emption warning and attempt to spin up and migrate replicas to an on-demand instance to get limited to no downtime while using Spot.
Miscellaneous
How do I integrate FastAPI into my Service?
You can find more information in FastAPI HTTP Deployments in the Ray Serve docs. Once your production Service is deployed, you can access the automatic OpenAPI documentation from the UI.
<service_name>-<5 digit random hex>.<cloud_id>.s.anyscaleuserdata.com.
Why am I experiencing timeouts?
There a few avenues you should explore when investigating probable root-cause of the observed timeouts:
-
Load Balancer Idle Connection timeouts (AWS): If your requests take more than 60 seconds (the default AWS ELB idle connection timeout) to handle then you might observe 503 HTTP response codes returned when calling Anyscale Service. To increase this default timeout, please reach out to Anyscale Support and we’ll adjust this to a higher value (for your organization).
-
Load Balancer timeouts (GCP): The timeout for GCP is set to 600 seconds by default for request handling. Note, that this timeout is different from idle connection timeout in that it sets a hard deadline for request handling which will timeout irrespective of whether your application is sending response back or not (idle timeout will occur only on connections that didn’t see any response sent back)
How can I query my Service? What about querying a specific version when doing A/B testing?
With Services, users can navigate to the Anyscale Console UI and click on their Service. A drop down will provide a CLI and python script to query the Service endpoint. If doing a rollout, users may choose to query the primary or canary version.
Can I configure the URL slug and endpoint of an Anyscale Service?
Users can provide a name when deploying an Anyscale Production Service that is integrated into the host header.
The format for the host header:
<service_name>-<5 digit random hex>.<cloud_id>.s.anyscaleuserdata.com.
Can I remove the bearer token from my Anyscale Service?
Users have the option to toggle the bearer token for individual services by providing an access configuration in the Service YAML. By default, services are deployed with a bearer token, but it can be disabled using the query_auth_token_enabled
configuration.
The bearer token can be toggled for a running Service by rolling out a new version. It can take up to 5 minutes for the changes to be propagated, so exercise caution when toggling the bearer token for Services in production.