Anyscale operator for Kubernetes

The Anyscale operator for Kubernetes enables deploying the Anyscale platform on Kubernetes clusters on Amazon Elastic Kubernetes Service (EKS), Google Kubernetes Engine (GKE), Azure Kubernetes Service (AKS), Oracle Kubernetes Engine (OKE), CoreWeave, or other Kubernetes clusters running in the cloud or on-prem. See the diagram below for a high-level overview of the Anyscale operator:

View resources the Anyscale operator operates on.

Namespaced resources

Pods: each Anyscale / Ray node maps to a single pod.
Services + Ingresses: used for head node connectivity (user laptop -> Ray dashboard) and for exposing Anyscale services (user laptop -> Anyscale service). Ingresses may be either private or public.
Secrets: used to hold secrets used by the Anyscale operator.
ConfigMaps: used to store configuration options for the Anyscale operator.
Events: used to enhance workload observability.

Global resources

TokenReview: On the startup of an Anyscale node in an Anyscale workload, Anyscale uses the Kubernetes TokenReview API to verify a pod's identity when the pod bootstraps itself to the Anyscale control plane.
Nodes: The operator periodically reads node information to enhance workload observability.

Installing the Helm chart for the Anyscale operator requires permissions to create cluster roles and cluster role bindings, which grant the Anyscale operator the necessary permissions to manage the preceding global resources. If you don't have these permissions, consider deploying Anyscale inside of vCluster in a Namespace of your choice.

Deployment modes

Cloud-native mode (supported for AWS, GCP, and Azure)

Cloud-native mode includes first-class support for all Anyscale features, but requires setting up additional peripheral cloud resources before deploying the Anyscale operator, as shown in the following table:

Cloud	Required cloud resource configurations
AWS	S3 bucket, IAM role, EFS mount (optional)
GCP	GCS bucket, project ID, service account, Filestore mount (optional)
Azure	Blob storage container, Microsoft Entra identity, NFS mount (optional)

Cloud-native mode is supported on AWS, GCP, and Azure. See the Terraform modules for a reference on the peripheral cloud resources required for cloud registration.

Cloud-agnostic mode (supported for any Kubernetes cluster)

Cloud-agnostic mode is more flexible and doesn't necessarily require setting up peripheral cloud resources. However, some Anyscale features, such as viewing logs through the Anyscale console, may be missing or unsupported unless the relevant cloud resources have been provided.

tip

If running on EKS, GKE, or AKS, use cloud-native mode when possible.

Prerequisites

A Kubernetes cluster.
- Use Kubernetes v1.28 or later when possible. Earlier versions may work, but aren't fully tested.
Permissions to deploy a Helm chart into the Kubernetes cluster.
The name of the Kubernetes Namespace that you would like to deploy the Anyscale operator inside of.
An ingress controller. Use the Ingress-NGINX controller when possible. Other ingress controllers may work as well, but aren't fully tested. When using the Ingress-NGINX controller, ensure the following are set:
- allow-snippet-annotations set to true.
- annotations-risk-level set to Critical.
- enable-underscores-in-headers set to true. This is used by Anyscale services.
For direct networking, configure an internet-facing load balancer that opens port 443 access to the head pod.
For customer-defined networking, configure an internal load balancer.
- In some cases, you can apply an annotation on the LoadBalancer service in front of the NGINX pods to configure internal load balancing.
As a reference, see 2. Choose an Anyscale Cloud deployment method for the difference between direct and customer-defined networking modes on the AWS VM stack including the pros and cons of each approach.
Egress to the internet from Anyscale pods deployed into the Kubernetes cluster. This is a requirement of all Anyscale deployments.
If using GPU's, appropriate Nvidia drivers and device plugins (references: EKS, GKE, AKS).

Cloud-native, AWS
Cloud-native, GCP
Cloud-native, Azure
Cloud-agnostic

An S3 bucket for system and artifact storage.
- The Anyscale operator and all Pods created by the operator must have direct access to this storage bucket.
- See object storage permissions for additional details.
An IAM role for the Anyscale operator to use, for the purposes of verifying the operator identity.
(Optional, highly recommended) An EFS mount target with subnets and security group allowing communication from the EKS cluster.
- Anyscale uses Amazon EFS for Anyscale workspace persistence, as well as cluster shared storage.
- If an EFS mount target is not provided, workspace persistence and cluster shared storage will be disabled.

See the Terraform modules for a reference on provisioning the core cloud resources required for cloud registration.

A cloud storage bucket. Supported storage buckets include S3 or S3-compatible buckets, Google Cloud Storage buckets, and Azure Blob Storage containers (s3://<bucket-name>, gs://<bucket-name>, or azure://<container-name>).
- Anyscale uses this cloud storage bucket for persisting various system artifacts, including runtime environment uploads from anyscale job and anyscale service CLI commands.
- For S3-compatible storage, if desired, provide an endpoint URL to override the default AWS_ENDPOINT_URL.
- For Azure Blob Storage, an endpoint URL of the form https://<storage-account-name>.blob.core.windows.net is required.
- The Anyscale operator and all Pods created by the operator must have direct access to this storage bucket.
- See object storage permissions for additional details.
(Optional, highly recommended) An NFS mount target.
- Anyscale uses NFS for Anyscale workspace persistence, as well as cluster shared storage.
- If you don't provide an NFS mount target, Anyscale disables workspace persistence and cluster shared storage.
- If desired, a path to pass into the NFS volume specification.

NOTE: Some Anyscale features, such as log viewing through the UI, may not be supported in cloud-agnostic mode at this time.

Permissions

The Anyscale operator requires the following permissions to be able to run Ray workloads on Kubernetes.

Kubernetes Permissions

The Anyscale operator must be run with a Kubernetes Service Account that has permissions to operate on a handful of core Kubernetes resources. For details on these permissions, see the Role and ClusterRole in the Anyscale operator Helm Chart.

Object Storage Permissions

The Anyscale operator (and Pods created by the Anyscale operator) must have access to the object storage bucket that is used for system and artifact storage. The Anyscale operator must additionally have the ability to generate presigned URLs for reading and writing artifacts to the object storage bucket.

Access to this storage bucket may be granted in a variety of ways, depending on the environment in which the Kubernetes cluster is running and the provider of the storage bucket.

AWS S3
Google Cloud Storage
Azure Blob Storage

We recommend following these references to grant Anyscale workloads access to an AWS S3 bucket:

If deploying in EKS: Learn how EKS Pod Identity grants pods access to AWS services
If deploying on-prem: Extend AWS IAM roles to workloads outside of AWS with IAM Roles Anywhere

Deployment

Add the Anyscale Helm chart repository.

helm repo add anyscale https://anyscale.github.io/helm-charts
helm repo update anyscale

tip

Before registering and deploying the Anyscale operator, review ways to customize the Helm chart to modify the deployment. For example you can:

Then, sign in to your Anyscale account using anyscale login, and proceed with the following steps:

Cloud-native, AWS
Cloud-native, GCP
Cloud-native, Azure
Cloud-agnostic

anyscale cloud register --name <cloud-name> \
--provider aws \
--region <region> \
--compute-stack k8s \
--kubernetes-zones <comma-separated-zones> \
--anyscale-operator-iam-identity <anyscale-operator-iam-role-arn> \
--cloud-storage-bucket-name s3://<cloud-storage-bucket-name> \
--file-storage-id <efs-id>

helm upgrade <release-name> anyscale/anyscale-operator \
--set-string cloudDeploymentId=<cloud-deployment-id> \
--set-string cloudProvider=aws \
--set-string region=<region> \
--set-string workloadServiceAccountName=anyscale-operator \
--namespace <namespace> \
--create-namespace \
-i

anyscale cloud register --name <cloud-name> \
--provider gcp \
--region <region> \
--compute-stack k8s \
--kubernetes-zones <comma-separated-zones> \
--anyscale-operator-iam-identity <anyscale-operator-service-account-email> \
--cloud-storage-bucket-name gs://<cloud-storage-bucket-name> \
--project-id <project-id> \ # (Optional) only required if using Filestore NFS mounts
--vpc-name <vpc-name> \ # (Optional) used to discover Filestore NFS mount targets
--file-storage-id <filestore-instance-id> \ # (Optional) the Filestore Instance ID
--filestore-location <filestore-location> # (Optional) the Filestore location

helm upgrade <release-name> anyscale/anyscale-operator \
--set-string cloudDeploymentId=<cloud-deployment-id> \
--set-string cloudProvider=gcp \
--set-string region=<region> \
--set-string operatorIamIdentity=<anyscale-operator-service-account-email> \
--set-string workloadServiceAccountName=anyscale-operator \
--namespace <namespace> \
--create-namespace \
-i

gcloud iam service-accounts add-iam-policy-binding <anyscale-operator-service-account-email> \
    --role roles/iam.workloadIdentityUser \
    --member "serviceAccount:<project-id>.svc.id.goog[<namespace>/anyscale-operator]"

# Acquire an ANYSCALE_CLI_TOKEN from the Anyscale console, and set it as an environment variable.
export ANYSCALE_CLI_TOKEN=<cli-token>

anyscale cloud register --name <cloud-name> \
--provider azure \
--compute-stack k8s \
--region <region> \
--cloud-storage-bucket-name azure://<container-name> \
--cloud-storage-bucket-endpoint https://<storage-account-name>.blob.core.windows.net \
--nfs-mount-target <(passed to the "server" attr. of the NFS volume spec)> \
--nfs-mount-path  <(passed to the "path" attr. of the NFS volume spec)>

helm upgrade <release-name> anyscale/anyscale-operator \
--set-string cloudDeploymentId=<cloudDeploymentId> \
--set-string cloudProvider=azure \
--set-string region=<region> \
--set-string operatorIamIdentity=<anyscale-operator-client-id> \
--set-string workloadServiceAccountName=anyscale-operator \
--set-string anyscaleCliToken=$ANYSCALE_CLI_TOKEN \
--namespace <namespace> \
--create-namespace \
-i

# Acquire an ANYSCALE_CLI_TOKEN from the Anyscale console, and set it as an environment variable.
export ANYSCALE_CLI_TOKEN=<cli-token>

anyscale cloud register --name <cloud-name> \
--provider generic \
--compute-stack k8s \
--region <region> \
--cloud-storage-bucket-name <(s3://, gs://, or azure://cloud-storage-bucket-name)> \
--cloud-storage-bucket-endpoint <(https://object.lga1.coreweave.com/ or https://<storage-account-name>.blob.core.windows.net, for example)> \
--nfs-mount-target <(passed to the "server" attr. of the NFS volume spec)> \
--nfs-mount-path  <(passed to the "path" attr. of the NFS volume spec)>

helm upgrade <release-name> anyscale/anyscale-operator \
--set-string cloudDeploymentId=<cloudDeploymentId> \
--set-string cloudProvider=generic \
--set-string anyscaleCliToken=$ANYSCALE_CLI_TOKEN \
--namespace <namespace> \
--create-namespace \
-i

tip

The helm upgrade command requires a cloud deployment ID, which is emitted when you register the cloud. If you forget your cloud deployment ID, you can retrieve it using anyscale cloud config get --name <cloud-name>.

At this point, the Anyscale operator should come up and start posting health checks to the Anyscale Control Plane. You should be ready to run workloads as you normally would on Anyscale clouds.

Try to submit a job to verify the Anyscale operator installation:

anyscale job submit --cloud <cloud-name> --working-dir https://github.com/anyscale/docs_examples/archive/refs/heads/main.zip -- python hello_world.py

Uninstall the Anyscale operator

View uninstallation instructions

To uninstall the Anyscale operator, run the following command:

helm uninstall <release-name> -n <namespace>
kubectl delete namespace <namespace>

To delete the cloud, run the following command:

anyscale cloud delete --name <cloud-name>

Known limitations

Cloud deployments on Kubernetes do not support:

Attaching machines from customer managed machine pools
Container & instance-level optimizations for accelerated cluster startup (fast model loading is still supported)

Namespaced resources​

Global resources​

Deployment modes​

Cloud-native mode (supported for AWS, GCP, and Azure)​

Cloud-agnostic mode (supported for any Kubernetes cluster)​

Prerequisites​

Permissions​

Kubernetes Permissions​

Object Storage Permissions​

Deployment​

Uninstall the Anyscale operator​

Known limitations​