Skip to main content

Architecture

Anyscale simplifies building, running, and managing Ray apps by automating the creation, scaling, and termination of cloud resources required to run Ray workloads.

To provide these functionalities, the Anyscale divides the platform into the Anyscale control plane and customer data planes. The following diagram summarizes the Anyscale architecture:

Running a script on Anyscale

Anyscale control plane

The Anyscale control plane serves as the orchestration and presentation layer behind all Anyscale functionalities, including the console, API, and user management. Managed by Anyscale, the control plane operates within Anyscale-managed cloud environments.

This control plane is a multi-tenant application designed for high availability and security. Users interact with the control plane to initiate actions such as creating or deleting Ray clusters, deploying production jobs and services, and managing user access.

Customer data plane

The customer data plane is the deployment zone for Ray clusters that the control plane manages. Each Anyscale customer has their own data plane within their cloud environment with Anyscale Clouds. Anyscale customers can deploy multiple Anyscale Clouds as part of their data plane in either AWS or GCP. For more information on how to deploy clouds, see Anyscale Cloud on AWS and Anyscale Cloud on Google Cloud.

The dual plane architecture has the following benefits:

  • Customer isolation: Because each customer owns their own data plane, Anyscale reduces the risk of sharing or unintentionally exposing data between tenants.
  • Data ingress and egress: The data plane allows customers to deploy compute resources wherever their data exists, without having to move data outside of their environments.
  • Geographic distribution: Customers can deploy Anyscale Clouds in any commercially available AWS or GCP region, reducing latency between users and clusters as well as clusters to data.
  • Network traffic to the data plane: Because Anyscale customers own their data plane, they are able to interact with their Ray clusters without traversing the control plane infrastructure, improving performance, reliability, and privacy.

Anyscale interacts with customer data planes primarily through the APIs provided by the cloud service provider. Using either cross-account IAM roles with external trusts or Workload Identity Federation, Anyscale can manage the lifecycle of Ray clusters on behalf of customers without storing permanent credentials.

After Ray clusters are created, they call back to the control plane to finalize configuration, report health checks, and publish other system logs. All channels of communication are encrypted and secured by role. Communication from clients to Anyscale-managed clusters occurs over an encrypted channel (TLS 1.2+ or equivalent).

Data and control flows between user, data plane, and control plane

Workloads run on top of Anyscale-managed Ray clusters. Data and control signals flow between the user, the Anyscale control plane, and the customer data plane throughout the lifecycle of a workload. The exact flows may vary depending on how the workload is initialized as well as whether it is an Anyscale Workspace, an Anyscale Job, or an Anyscale Service. The following represents a possible flow in the context of a user-initiated Anyscale cluster.

Creating clusters for workloads

  1. You can log into the Anyscale console using the native user management system. Alternatively you can set up Single Sign On (SSO) so that you can access the console through your SSO provider. Once you are federated into the Anyscale console, you can retrieve a new Anyscale CLI token.
  2. Using the SDK or CLI from a laptop, you run a command to create a Ray cluster. In the request, you specify the following (among others):
    • The Anyscale Cloud where the cluster launches within the customer data plane.
    • A compute config that defines the types and quantities of virtual machines associated to the cluster. For more information about Ray clusters, see the Ray documentation.
    • A container image that defines the dependencies and configuration of the Ray runtime container on the cluster.
  3. The request is sent to the Anyscale APIs hosted in the Anyscale control plane. Anyscale authenticates your CLI token as a valid token, and checks your permissions to authorize the request.
  4. The Anyscale cluster management system assumes a role using cross account IAM roles or Workload Identity Federation and initiates the creation of resources in the data plane in the specified account. Once active, the Ray cluster calls back to the Anyscale control plane to register nodes, finalize configurations, and pull the container image.

Running workloads

  1. The Anyscale cluster sends periodic health checks and system data back to the control plane. Similarly, Ray autoscaling requests are directed back to the control plane for Anyscale to manage. This design reduces the IAM permissions required for Anyscale clusters themselves.
  2. Depending on the version of the Anyscale CLI used to create the cloud, app logs from the Ray workload persist within a storage bucket associated with the cloud.
  3. Requests to the cluster are authenticated using temporary access tokens that are vended by the control plane. The user can interact with the Anyscale cluster in a number of ways, including:
    • Viewing the health of the cluster in the Anyscale console and query status through API, CLI, or SDK.
    • Accessing the Ray Dashboard, Jupyter notebooks, or the Grafana dashboard hosted on the head node of the cluster.
    • Submitting jobs to the cluster by using the Anyscale provided address that routes directly to the cluster's IP addresses (not through the control plane).

Workload termination

When the user terminates workloads at any time using the console, API, SDK, or CLI. Much like creation, the Anyscale control plane assumes a cross-account entity and leverages the cloud provider's APIs to initialize and track the termination of resources associated with the cluster.

Architectural advantages of Anyscale clusters

Any workload started by Anyscale runs on an Anyscale-managed Ray cluster. Important advantage of Anyscale-managed over open source Ray clusters:

  • Users don't need to manage the cluster lifecycle. They can start a cluster using UI, CLI, or SDK and Anyscale automatically terminates it when it idles.

  • Anyscale clusters automatically scale to accommodate a workload. Anyscale provides a set of default compute configs for CPU and GPU workloads as a starting point. If the user wants to customize parameters (for example the instance types or maximum cluster size), they can do so by specifying their own compute config.

  • Users can create Anyscale Clouds that include an AWS or GCP identity, which runs the data plane on their own cloud account. If they're using multiple Anyscale Clouds, they can change their preferred default and also override the default within their compute config.

  • Anyscale provides multiple ways to manage dependencies. Anyscale allows users to build container images, creating Docker images that package all the dependencies ahead of runtime. This way, cluster startup can be fast and reliable.

Apps

Anyscale provides convenient ways to scale up Ray apps during both the development and the production stages.

Development

With Anyscale, machine learning practitioners can use Anyscale Workspaces to program the cluster while working with familiar tools like JupyterLab notebooks or Visual Studio Code. Workspaces are a fully managed development environment focused on the developer productivity.

Production

In production, it's preferable to use the cloud to run not just compute-intensive operations but also the app itself. Additionally, the application should be fault-tolerant to avoid incidents and guarantee availability.

Anyscale supports production workloads, which fall under two types:

  • If the app has a bounded amount of computation to do, for example, an RLlib app, use an Anyscale Job. When you start the job, Anyscale provisions a Ray cluster, deploys the app, and runs it to completion, restarting it in the event of failure.
  • If the app needs to run indefinitely, for example, a Ray Serve app, use an Anyscale Service. When the user starts the service, Anyscale provisions a Ray cluster that provides high availability, rolling upgrades without downtime etc.

Anyscale Jobs and Services leverage Anyscale clusters, which means that they also support the usage of compute configs and container images.