Anyscale operator for Kubernetes
The Anyscale operator for Kubernetes is in developer preview.
The Anyscale operator for Kubernetes enables deploying the Anyscale platform on Kubernetes clusters on Amazon Elastic Kubernetes Service (EKS), Google Kubernetes Engine (GKE), Oracle Kubernetes Engine (OKE), Azure Kubernetes Service (AKS), CoreWeave, or other Kubernetes clusters running in the cloud or on-prem. See the diagram below for a high-level overview of the Anyscale operator:
View resources the Anyscale operator operates on.
Namespaced resources
- Pods: each Anyscale / Ray node maps to a single pod.
- Services + Ingresses: used for head node connectivity (user laptop -> Ray dashboard) and for exposing Anyscale services (user laptop -> Anyscale Service). Ingresses may be either private or public.
- Secrets: used to hold secrets used by the Anyscale operator.
- ConfigMaps: used to store configuration options for the Anyscale operator.
- Events: used to enhance workload observability.
Global resources
- TokenReview: On the startup of an Anyscale node in an Anyscale workload, Anyscale uses the Kubernetes TokenReview API to verify a pod's identity when the pod bootstraps itself to the Anyscale control plane.
- Nodes: The operator periodically reads node information to enhance workload observability.
Installing the Helm chart for the Anyscale operator requires permissions to create cluster roles and cluster role bindings, which grant the Anyscale operator the necessary permissions to manage the preceding global resources. If you don't have these permissions, consider deploying Anyscale inside of vCluster in a namespace of your choice.
Deployment modes
Cloud-native mode (only supported for AWS and GCP)
Cloud-native mode comes with first-class support for all Anyscale features, but requires setting up additional peripheral cloud resources (S3 buckets, IAM roles, etc.) before deploying the Anyscale operator. At this time, cloud-native mode is only supported on AWS and GCP. See the Terraform modules for a reference on these peripheral cloud resources required for cloud registration.
Cloud-agnostic mode (supported for any Kubernetes cluster)
Cloud-agnostic mode is more flexible and doesn't necessarily require setting up peripheral cloud resources. However, some Anyscale features, such as viewing logs through the Anyscale console, may be missing or unsupported unless the relevant cloud resources have been provided.
If running on EKS or GKE, use cloud-native mode when possible.
Prerequisites
- A Kubernetes cluster.
- Use Kubernetes v1.28 or later when possible. Earlier versions may work, but aren't fully tested.
- Permissions to deploy a Helm chart into the Kubernetes cluster.
- The name of the Kubernetes namespace that you would like to deploy the Anyscale operator inside of.
- An ingress controller. Use the Ingress-NGINX controller when possible. Other ingress controllers may work as well, but aren't fully tested.
- For direct networking, configure an internet-facing load balancer.
- For customer-defined networking, configure an internal load balancer.
- In some cases, an annotation on the LoadBalancer service in front of the NGINX pods can be applied to configure internal load balancing.
- As a reference, see this link for the difference between direct and customer-defined networking modes on the AWS VM stack (+ the pros/cons of each approach).
- An IP or hostname that resolves to your ingress.
- For public clouds, this should be a public IP or hostname that resolves to a public IP.
- For private clouds, this should be a private IP or hostname that resolves to a private IP.
- Cloud-native, AWS
- Cloud-native, GCP
- Cloud-agnostic
- An S3 bucket for system and artifact storage.
- All Pods created by Anyscale must have direct access to this storage bucket.
- An IAM role for Anyscale to assume to generate presigned URLs to the S3 bucket (this is how Anyscale provides log viewing capabilities through the Anyscale console, as well as log download features).
- An IAM role for the Anyscale operator to use, for the purposes of verifying the operator identity.
See https://registry.terraform.io/modules/anyscale/anyscale-foundation-modules/kubernetes/latest for a reference on provisioning the core cloud resources required for cloud registration.
- A GCS bucket for system and artifact storage.
- All Pods created by Anyscale must have direct access to this storage bucket.
- A Service Account for Anyscale to use through Workload Identity Federation to generate presigned URLs to the storage bucket (this is how Anyscale provides log viewing capabilities through the Anyscale console, as well as log download features).
- A Workload Identity Pool Provider that grants the Anyscale Control Plane access to the preceding Service Account.
- The project ID of the Google Project that contains the preceding resources.
- A service account for the Anyscale operator, for the purposes of verifying the operator identity.
See https://registry.terraform.io/modules/anyscale/anyscale-foundation-modules/kubernetes/latest for a reference on provisioning the core cloud resources required for cloud registration.
- A cloud storage bucket (optional, highly recommended). Supported storage buckets include Google Cloud Storage buckets, S3 buckets, or S3-compatible buckets (
s3://<bucket-name>
orgs://<bucket-name>
).- Anyscale uses this cloud storage bucket for persisting various system artifacts in the customer account, including runtime environment uploads from
anyscale job
andanyscale service
CLI commands. - If desired, an endpoint URL to override the default
AWS_ENDPOINT_URL
.
- Anyscale uses this cloud storage bucket for persisting various system artifacts in the customer account, including runtime environment uploads from
- An NFS mount target (optional, highly recommended).
- Anyscale uses NFS for Anyscale Workspaces persistence, as well as cluster shared storage.
- If desired, a path to pass into the NFS volume specification.
NOTE: Some Anyscale features, such as log viewing through the UI, aren't supported in cloud-agnostic mode at this time.
Deployment
Download the Helm chart and save it to a local directory.
Then, sign in to your Anyscale account using anyscale login
, and proceed with the following steps:
- Cloud-native, AWS
- Cloud-native, GCP
- Cloud-agnostic
anyscale cloud register --name <cloud-name> \
--provider aws \
--region <region> \
--compute-stack k8s \
--kubernetes-namespaces <namespace> \
--kubernetes-ingress-external-address <kubernetes-ingress-external-address-or-ip> \
--kubernetes-zones <comma-separated-zones> \
--kubernetes-dataplane-identity <data-plane-iam-role-arn> \
--anyscale-iam-role-id <control-plane-iam-role-arn> \
--s3-bucket-id <s3-bucket-arn> \
--efs-id <efs-id>
helm upgrade <release-name> ./chart \
--set-string cloudDeploymentId=<cloud-deployment-id> \
--set-string cloudProvider=aws \
--set-string region=<region> \
--set-string workloadServiceAccountName=anyscale-operator \
--namespace <namespace> \
--create-namespace \
-i
anyscale cloud register --name <cloud-name> \
--provider gcp \
--region <region> \
--compute-stack k8s \
--kubernetes-namespaces <namespace> \
--kubernetes-ingress-external-address <kubernetes-ingress-external-address-or-ip> \
--kubernetes-zones <comma-separated-zones> \
--kubernetes-dataplane-identity <data-plane-service-account-email> \
--project-id <project-id> \
--anyscale-service-account-email <service-account-email> \
--provider-name <provider-name> \
--cloud-storage-bucket-name <cloud-storage-bucket-name> \
--vpc-name <vpc-name> \ # used to discover NFS mount targets from the Filestore instance below
--filestore-instance-id <filestore-instance-id> \
--filestore-location <filestore-location>
helm upgrade <release-name> ./chart \
--set-string cloudDeploymentId=<cloud-deployment-id> \
--set-string cloudProvider=gcp \
--set-string region=<region> \
--set-string workloadServiceAccountName=anyscale-operator \
--namespace <namespace> \
--create-namespace \
-i
gcloud iam service-accounts add-iam-policy-binding <data-plane-service-account-email> \
--role roles/iam.workloadIdentityUser \
--member "serviceAccount:<project-id>.svc.id.goog[<namespace>/anyscale-operator]"
kubectl annotate serviceaccount anyscale-operator --namespace <namespace> iam.gke.io/gcp-service-account=<dataplane-service-account-email>
anyscale cloud register --name <cloud-name> \
--provider generic \
--compute-stack k8s \
--kubernetes-namespaces <namespace> \
--kubernetes-ingress-external-address <(public or private IP/hostname resolving to ingress)> \
--cloud-storage-bucket-name <(s3:// or gcs://)> \
--cloud-storage-bucket-endpoint <(https://object.lga1.coreweave.com/, for example)> \
--nfs-mount-target <(passed to the "server" attr. of the NFS volume spec)> \
--nfs-mount-path <(passed to the "path" attr. of the NFS volume spec)>
# Acquire an ANYSCALE_CLI_TOKEN from the Anyscale console, and set it as an environment variable.
export ANYSCALE_CLI_TOKEN=<cli-token>
helm upgrade <release-name> ./chart \
--set-string cloudDeploymentId=<cloudDeploymentId> \
--set-string cloudProvider=generic \
--set-string anyscaleCliToken=$ANYSCALE_CLI_TOKEN \
--namespace <namespace> \
--create-namespace \
-i
At this point, the Anyscale operator should come up and start posting health checks to the Anyscale Control Plane. You should be ready to run workloads as you normally would on Anyscale clouds.
Try to submit a job to verify the Anyscale operator installation:
anyscale job submit --cloud <cloud-name> --working-dir https://github.com/anyscale/docs_examples/archive/refs/heads/main.zip -- python hello_world.py
Configuration options
End-users of Anyscale features (data scientists, ML engineers, etc.) submit workloads to Anyscale by defining compute configs, which allow them control over instance types and shapes that their app requires. As an example, consider the following compute configuration for a Ray workload that requires some CPU workers, and some A10G workers on AWS:
cloud: aws-cloud
zones:
- us-west-2a
- us-west-2b
head_node:
instance_type: m5.8xlarge
worker_nodes:
- instance_type: m5.8xlarge
min_nodes: 0
max_nodes: 5
market_type: PREFER_SPOT
- instance_type: g5.4xlarge
min_nodes: 0
max_nodes: 5
market_type: ON_DEMAND
The Anyscale operator supports all of these features (zone selection, instance type selection, market type selection), but requires customization to integrate with cluster-specific properties. Many of these properties are set through Helm chart options.
Instance Type ConfigMap
When running on top of Kubernetes, an Anyscale "instance type" maps to a Pod shape. The cloud administrator defines instance types when setting up the Anyscale operator through either the Helm chart options or out-of-band by editing the instance-types
ConfigMap that the Helm chart creates.
Here is an example of what the generated ConfigMap may look like -
(base) [~]$ k get configmap instance-types -o yaml
apiVersion: v1
data:
instance_types.yaml: |-
# A small CPU-only shape.
2CPU-8GB:
resources:
CPU: 2
memory: 8Gi
# A larger shape with both CPU and GPU.
8CPU-32GB-1xT4:
resources:
CPU: 8
GPU: 1
accelerator_type:T4: 1
memory: 32Gi
version: v1
2CPU-8GB
and 8CPU-32GB-1xT4
are names that follow an Anyscale naming convention. Cloud administrators may use a naming convention of their choice - valid characters include alphanumeric characters, dashes, and underscores.
Each instance type defined in the ConfigMap is visible in the Anyscale UI through a drop-down list. Users can select these instance types when submitting workloads. Users may also define compute configs that use these instance types through the Anyscale CLI/SDK.
The Anyscale console is updated roughly every ~30 seconds with the latest instance types defined in the ConfigMap.
For accelerators, the accelerator_type
value should map to the list of Ray-supported accelerators. If an accelerator type isn't defined in this list, open an issue on the Ray GitHub repository, and forward it to Anyscale support.
When the Anyscale operator applies a pod spec to Kubernetes for an Anyscale workload, the operator uses the shapes defined in the Instance Type ConfigMap as an upper bound for the sum of all of the memory requests & limits across all containers in the pod. Anyscale reserves some memory / CPU for critical-path Anyscale sidecar containers, and provides the rest to the Ray container to run the primary workload.
Advanced: Patch ConfigMap
Different Kubernetes clusters have some variance when it comes to spot handling / accelerator handling / etc.
The Patch API provides an escape hatch to handle custom integrations. This API allows for just-in-time patching of all Anyscale-managed resources as they're applied to the Kubernetes cluster. The syntax used for the Patch API is the JSON Patch syntax (ITEF specification). As an example, consider the patch below:
patches:
- kind: Pod
# See: https://kubernetes.io/docs/concepts/overview/working-with-objects/labels/#label-selectors
selector: "anyscale.com/market-type in (ON_DEMAND)"
# See: https://jsonpatch.com/
patch:
- op: add
path: /spec/nodeSelector/eks.amazonaws.com~1capacityType # use ~1 to escape the forward-slash
value: "ON_DEMAND"
For all Pods that the Anyscale operator creates, the operator applies the set of patches to all pods that match the Kubernetes selector. In this case, the operator applies the eks.amazonaws.com/capacityType
node selector to the Pod spec.
The Helm chart generates a variety of patches using the default configuration options that should work on EKS/GKE out-of-the-box without additional configuration. Additional patches to support custom autoscalers, ingresses, or other cluster-specific properties may be provided through the Helm chart.
View all labels that can be used for selection / patching
Label Name | Possible Label Values | Description |
---|---|---|
anyscale.com/market-type | SPOT, ON_DEMAND | Users with workloads that support preemption may opt to run their workloads on spot node types through the compute config. All other workloads are run on on-demand node types. This should most likely be transformed into a node affinity. |
anyscale.com/zone | user-defined through cloud setup | For Pods that have a specific zone affinity, the Anyscale operator sets this label to the zone that the Pod should be launched into (us-west-2a , for example). Zones are provided as []string at cloud registration time and can be selected from the Anyscale UI. This should most likely be transformed into a node affinity. |
anyscale.com/accelerator-type | user-defined through instance type configuration | When requesting a GPU Pod, the Anyscale operator sets one of the following values: Anyscale accelerator types. |
anyscale.com/instance-type | user-defined through instance type configuration | The operator sets this value for all Pods created through Anyscale. |
anyscale.com/canary-weight anyscale.com/canary-exists anyscale.com/canary-svc anyscale.com/ingress-type anyscale.com/bearer-token anyscale.com/primary-weight anyscale.com/primary-svc | various | For advanced use only (when using an ingress other than NGINX for inference / serving workloads with Anyscale Services). Contact Anyscale for more details. |
Uninstall the Anyscale operator
View uninstallation instructions
To uninstall the Anyscale operator, run the following command:
helm uninstall <release-name> -n <namespace>
kubectl delete namespace <namespace>
To delete the cloud, run the following command:
anyscale cloud delete --name <cloud-name>