Skip to main content

Anyscale operator for Kubernetes

The Anyscale operator for Kubernetes enables deploying the Anyscale platform on Kubernetes clusters on Amazon Elastic Kubernetes Service (EKS), Google Kubernetes Engine (GKE), Oracle Kubernetes Engine (OKE), Azure Kubernetes Service (AKS), CoreWeave, or other Kubernetes clusters running in the cloud or on-prem. See the diagram below for a high-level overview of the Anyscale operator:

View resources the Anyscale operator operates on.

Namespaced resources

  • Pods: each Anyscale / Ray node maps to a single pod.
  • Services + Ingresses: used for head node connectivity (user laptop -> Ray dashboard) and for exposing Anyscale Services (user laptop -> Anyscale Service). Ingresses may be either private or public.
  • Secrets: used to hold secrets used by the Anyscale operator.
  • ConfigMaps: used to store configuration options for the Anyscale operator.
  • Events: used to enhance workload observability.

Global resources

  • TokenReview: On the startup of an Anyscale node in an Anyscale workload, Anyscale uses the Kubernetes TokenReview API to verify a pod's identity when the pod bootstraps itself to the Anyscale control plane.
  • Nodes: The operator periodically reads node information to enhance workload observability.

Installing the Helm chart for the Anyscale operator requires permissions to create cluster roles and cluster role bindings, which grant the Anyscale operator the necessary permissions to manage the preceding global resources. If you don't have these permissions, consider deploying Anyscale inside of vCluster in a Namespace of your choice.

Deployment modes

Cloud-native mode (only supported for AWS and GCP)

Cloud-native mode comes with first-class support for all Anyscale features, but requires setting up additional peripheral cloud resources (S3 buckets, IAM roles, etc.) before deploying the Anyscale operator. At this time, cloud-native mode is only supported on AWS and GCP. See the Terraform modules for a reference on these peripheral cloud resources required for cloud registration.

Cloud-agnostic mode (supported for any Kubernetes cluster)

Cloud-agnostic mode is more flexible and doesn't necessarily require setting up peripheral cloud resources. However, some Anyscale features, such as viewing logs through the Anyscale console, may be missing or unsupported unless the relevant cloud resources have been provided.

tip

If running on EKS or GKE, use cloud-native mode when possible.

Prerequisites

  • A Kubernetes cluster.
    • Use Kubernetes v1.28 or later when possible. Earlier versions may work, but aren't fully tested.
  • Permissions to deploy a Helm chart into the Kubernetes cluster.
  • The name of the Kubernetes Namespace that you would like to deploy the Anyscale operator inside of.
  • An ingress controller. Use the Ingress-NGINX controller when possible. Other ingress controllers may work as well, but aren't fully tested. When using the Ingress-NGINX controller, the allow-snippet-annotations option should be set to true in the NGINX config map. This is used by Anyscale services.
    • For direct networking, configure an internet-facing load balancer.
    • For customer-defined networking, configure an internal load balancer.
      • In some cases, an annotation on the LoadBalancer service in front of the NGINX pods can be applied to configure internal load balancing.
    • As a reference, see this link for the difference between direct and customer-defined networking modes on the AWS VM stack (+ the pros/cons of each approach).
  • Egress to the internet from Anyscale pods deployed into the Kubernetes cluster. This is a requirement of all Anyscale deployments.
  • If using GPU's, appropriate Nvidia drivers and device plugins (references: EKS, GKE, AKS).
  • An S3 bucket for system and artifact storage.
    • The Anyscale operator and all Pods created by the operator must have direct access to this storage bucket.
    • See object storage bucket permissions for additional details.
  • An IAM role for the Anyscale operator to use, for the purposes of verifying the operator identity.
  • (Optional, highly recommended) An EFS mount target with subnets and security group allowing communication from the EKS cluster.
    • Anyscale uses Amazon EFS for Anyscale Workspaces persistence, as well as cluster shared storage.
    • If an EFS mount target is not provided, Workspaces persistence and cluster shared storage will be disabled.

See https://registry.terraform.io/modules/anyscale/anyscale-foundation-modules/kubernetes/latest for a reference on provisioning the core cloud resources required for cloud registration.

Permissions

The Anyscale operator requires the following permissions to be able to run Ray workloads on Kubernetes.

Kubernetes Permissions

The Anyscale operator must be run with a Kubernetes Service Account that has permissions to operate on a handful of core Kubernetes resources. For details on these permissions, see the Role and ClusterRole in the Anyscale operator Helm Chart.

Object Storage Bucket Permissions

The Anyscale operator (and Pods created by the Anyscale operator) must have access to the object storage bucket that is used for system and artifact storage. The Anyscale operator must additionally have the ability to generate presigned URLs for reading and writing artifacts to the object storage bucket.

Access to this storage bucket may be granted in a variety of ways, depending on the environment in which the Kubernetes cluster is running and the provider of the storage bucket.

We recommend following these references to grant Anyscale workloads access to an AWS S3 bucket:

Deployment

Add the Anyscale Helm chart repository.

helm repo add anyscale https://anyscale.github.io/helm-charts
helm repo update anyscale
tip

Before registering and deploying the Anyscale operator, review ways to customize the Helm chart to modify the deployment. For example you can:

Then, sign in to your Anyscale account using anyscale login, and proceed with the following steps:

anyscale cloud register --name <cloud-name> \
--provider aws \
--region <region> \
--compute-stack k8s \
--kubernetes-zones <comma-separated-zones> \
--anyscale-operator-iam-identity <anyscale-operator-iam-role-arn> \
--cloud-storage-bucket-name s3://<cloud-storage-bucket-name> \
--file-storage-id <efs-id>

helm upgrade <release-name> anyscale/anyscale-operator \
--set-string cloudDeploymentId=<cloud-deployment-id> \
--set-string cloudProvider=aws \
--set-string region=<region> \
--set-string workloadServiceAccountName=anyscale-operator \
--namespace <namespace> \
--create-namespace \
-i
tip

The helm upgrade command requires a cloud deployment ID, which is emitted when you register the cloud. If you forget your cloud deployment ID, you can retrieve it using anyscale cloud config get --name <cloud-name>.

At this point, the Anyscale operator should come up and start posting health checks to the Anyscale Control Plane. You should be ready to run workloads as you normally would on Anyscale clouds.

Try to submit a job to verify the Anyscale operator installation:

anyscale job submit --cloud <cloud-name> --working-dir https://github.com/anyscale/docs_examples/archive/refs/heads/main.zip -- python hello_world.py

Uninstall the Anyscale operator

View uninstallation instructions

To uninstall the Anyscale operator, run the following command:

helm uninstall <release-name> -n <namespace>
kubectl delete namespace <namespace>

To delete the cloud, run the following command:

anyscale cloud delete --name <cloud-name>

Known limitations

Cloud deployments on Kubernetes do not support: