Anyscale simplifies building, running, and managing Ray applications by automating the creation, scaling, and termination of cloud resources required to run Ray workloads.
To provide these functionalities, the Anyscale platform is divided into the Anyscale control plane and Customer data planes. The following diagram summarizes the Anyscale architecture:
Anyscale control plane
The Anyscale control plane serves as the orchestration and presentation layer behind all Anyscale functionalities, including the console, API, and user management. The control plane is managed by Anyscale and sits within Anyscale-managed cloud environments.
The control plane is a multi-tenant application that is highly available and secure by design. Users interact with the control plane to initiate actions like creating or deleting Ray clusters, deploying production jobs and services, and managing user access.
Customer data plane
The customer data plane is the deployment zone for Ray clusters managed by the control plane. Each Anyscale customer has their own data plane within their cloud environment with Anyscale Clouds. Anyscale customers can deploy multiple Anyscale Clouds as part of their data plane in either AWS or GCP. For more information on how to deploy clouds, see Anyscale Cloud on AWS and Anyscale Cloud on Google Cloud.
The dual plane architecture has the following benefits:
- Customer isolation: because each customer owns their own data plane, Anyscale reduces the risk of sharing or unintentionally exposing data between tenants.
- Data ingress and egress: the data plane allows customers to deploy compute resources wherever their data exists, without having to move data outside of their environments.
- Geographic distribution: customers can deploy Anyscale Clouds in any commercially available AWS or GCP region, reducing latency between users and clusters as well as clusters to data.
- Network traffic to the data plane: because Anyscale customers own their data plane, they are able to interact with their Ray clusters without traversing the control plane infrastructure, improving performance, reliability, and privacy.
Anyscale interacts with customer data planes primarily through the APIs provided by the cloud service provider. Using either cross-account IAM roles with external trusts or Workload Identity Federation, Anyscale can manage the lifecycle of Ray clusters on behalf of customers without storing permanent credentials.
After Ray clusters are created, they call back to the control plane to finalize configuration, report health checks, and publish other system logs. All channels of communication are encrypted and secured by role. Communication from clients to Anyscale-managed clusters occurs over an encrypted channel (TLS 1.2+ or equivalent).
Data and control flows between user, data plane, and control plane
Workloads run on top of Anyscale Clusters, that is Ray clusters that are managed by Anyscale. Throughout the lifecycle of a workload, data and control signals flow between the user, the Anyscale control plane, and the customer data plane. The exact flows may vary depending on how the workload is initialized as well as whether it is running on Anyscale Clusters, as anAnyscale Workspace, an Anyscale Job, or an Anyscale Service. The following represents a possible flow in the context of a user-initiated Anyscale Cluster.
Creating Clusters for workloads
- You can log into the Anyscale console using the native user management system. Alternatively you can set up Single Sign On (SSO) so that you can access the console through your SSO provider. Once you are federated into the Anyscale console, you can retrieve a new Anyscale CLI token.
- Using the SDK or CLI from a laptop, you run a command to create a Ray cluster. In the request, you specify the following (among others):
- The Anyscale Cloud where the cluster will be launched within the customer data plane.
- A Cluster Compute Config that defines the types and quantities of virtual machines associated to the cluster. For more information about Ray clusters, see the Ray documentation.
- A Cluster Environment that defines the dependencies and configuration of the Ray runtime container on the cluster.
- The request is sent to the Anyscale APIs hosted in the Anyscale control plane. Anyscale authenticates your CLI token as a valid token, and checks your permissions to authorize the request.
- The Anyscale cluster management system assumes a role using cross account IAM roles or Workload Identity Federation and initiates the creation of resources in the data plane in the specified account. Once active, the Ray cluster calls back to the Anyscale control plane to register nodes, finalize configurations, and pull the image for the cluster environment.
- The Anyscale Cluster sends periodic health checks and system data back to the control plane. Similarly, Ray autoscaling requests are directed back to the control plane for Anyscale to manage. This design reduces the IAM permissions required for Anyscale Clusters themselves.
- Depending on the version of the Anyscale CLI used to create the cloud, application logs from the Ray workload are persisted within a storage bucket associated with the cloud.
- Requests to the cluster are authenticated using temporary access tokens that are vended by the control plane. The user can interact with the Anyscale Cluster in a number of ways, including:
- Viewing the health of the cluster in the Anyscale console and query status via API, CLI, or SDK.
- Accessing the Ray dashboard, Jupyter notebooks, or the Grafana dashboard hosted on the head node of the cluster.
- Submitting Jobs to the cluster by using the Anyscale provided address that routes directly to the cluster's IP addresses (not through the control plane).
When the user terminates workloads at any time using the console, API, SDK, or CLI. Much like creation, the Anyscale control plane assumes a cross-account entity and leverages the cloud provider's APIs to initialize and track the termination of resources associated with the cluster.
Architectural advantages of Anyscale Clusters
Any workload started by Anyscale runs on an Anyscale-managed Ray cluster. Important advantage of Anyscale-managed over open-source Ray clusters:
Users will not need to manage the cluster lifecycle. They can start a cluster using UI, CLI or our SDK and Anyscale will also automatically terminate it when it idles.
Anyscale Clusters automatically scale to accommodate a workload. Anyscale provides a set of default Compute Configs for CPU and GPU workloads as a starting point. If the user wants to customize parameters (for example the instance types or maximum cluster size), they can do so by specifying their own Compute Config.
Users can create Anyscale Clouds that include an AWS or GCP identity, which will be used to run the data plane on their own cloud account. If they are using multiple Anyscale Clouds, they can change their preferred default and also override the default within their Compute Config.
Anyscale provides multiple ways to manage dependencies. Anyscale allows users to build Cluster Environments, creating Docker images that package all the dependencies ahead of runtime. This way, cluster startup can be fast and reliable. Anyscale also supports Runtime Environments which, among other things, enable users to upload code and data from their local machine onto the cluster, and make it possible to use different dependencies within different parts of their application.
Anyscale provides convenient ways to scale up Ray applications during both the development and the production stages.
With Anyscale, machine learning practitioners can use Anyscale Workspaces to program the cluster while working with familiar tools like JupyterLab notebooks or Visual Studio Code. Workspace is a fully managed development environment focused on the developer productivity.
In production, it is preferable to use the cloud to run not just compute-intensive operations but also the application itself. Additionally, the application should be fault-tolerant to avoid incidents and guarantee availability.
Anyscale supports production workloads which fall under two types:
- If the application has a bounded amount of computation to do, for example, an RLlib application, use an Anyscale Job. When you start the job, Anyscale will provision a Ray cluster, deploy the application and run it to completion, restarting it in the event of failure.
- If the application needs to run indefinitely, for example, a Ray Serve application, use an Anyscale Service. When the user starts the service, Anyscale will provision a Ray cluster that provides high availability, rolling upgrades without downtime etc.
Production APIs for both jobs and services are declarative. They allow users to specify a working directory as part of the Runtime Environment, which can contain both code and data that the application will use, along with an entrypoint to start the application. Anyscale Jobs and Services leverage Anyscale Clusters, which means that they also support the usage of Compute Configs and Cluster Environments.