Skip to main content

Anyscale operator for Kubernetes

info

The Anyscale operator for Kubernetes is in developer preview.

The Anyscale operator for Kubernetes enables deploying the Anyscale platform on Kubernetes clusters on Amazon Elastic Kubernetes Service (EKS), Google Kubernetes Engine (GKE), Oracle Kubernetes Engine (OKE), Azure Kubernetes Service (AKS), CoreWeave, or other Kubernetes clusters running in the cloud or on-prem. See the diagram below for a high-level overview of the Anyscale operator:

View resources the Anyscale operator operates on.

Namespaced resources

  • Pods: each Anyscale / Ray node maps to a single pod.
  • Services + Ingresses: used for head node connectivity (user laptop -> Ray dashboard) and for exposing Anyscale services (user laptop -> Anyscale Service). Ingresses may be either private or public.
  • Secrets: used to hold secrets used by the Anyscale operator.
  • ConfigMaps: used to store configuration options for the Anyscale operator.
  • Events: used to enhance workload observability.

Global resources

  • TokenReview: On the startup of an Anyscale node in an Anyscale workload, Anyscale uses the Kubernetes TokenReview API to verify a pod's identity when the pod bootstraps itself to the Anyscale control plane.
  • Nodes: The operator periodically reads node information to enhance workload observability.

Installing the Helm chart for the Anyscale operator requires permissions to create cluster roles and cluster role bindings, which grant the Anyscale operator the necessary permissions to manage the preceding global resources. If you don't have these permissions, consider deploying Anyscale inside of vCluster in a namespace of your choice.

Deployment modes

Cloud-native mode (only supported for AWS and GCP)

Cloud-native mode comes with first-class support for all Anyscale features, but requires setting up additional peripheral cloud resources (S3 buckets, IAM roles, etc.) before deploying the Anyscale operator. At this time, cloud-native mode is only supported on AWS and GCP. See the Terraform modules for a reference on these peripheral cloud resources required for cloud registration.

Cloud-agnostic mode (supported for any Kubernetes cluster)

Cloud-agnostic mode is more flexible and doesn't necessarily require setting up peripheral cloud resources. However, some Anyscale features, such as viewing logs through the Anyscale console, may be missing or unsupported unless the relevant cloud resources have been provided.

tip

If running on EKS or GKE, use cloud-native mode when possible.

Prerequisites

  • A Kubernetes cluster.
    • Use Kubernetes v1.28 or later when possible. Earlier versions may work, but aren't fully tested.
  • Permissions to deploy a Helm chart into the Kubernetes cluster.
  • The name of the Kubernetes namespace that you would like to deploy the Anyscale operator inside of.
  • An ingress controller. Use the Ingress-NGINX controller when possible. Other ingress controllers may work as well, but aren't fully tested.
    • For direct networking, configure an internet-facing load balancer.
    • For customer-defined networking, configure an internal load balancer.
      • In some cases, an annotation on the LoadBalancer service in front of the NGINX pods can be applied to configure internal load balancing.
    • As a reference, see this link for the difference between direct and customer-defined networking modes on the AWS VM stack (+ the pros/cons of each approach).
  • An IP or hostname that resolves to your ingress.
    • For public clouds, this should be a public IP or hostname that resolves to a public IP.
    • For private clouds, this should be a private IP or hostname that resolves to a private IP.
  • An S3 bucket for system and artifact storage.
    • All Pods created by Anyscale must have direct access to this storage bucket.
  • An IAM role for Anyscale to assume to generate presigned URLs to the S3 bucket (this is how Anyscale provides log viewing capabilities through the Anyscale console, as well as log download features).
  • An IAM role for the Anyscale operator to use, for the purposes of verifying the operator identity.

See https://registry.terraform.io/modules/anyscale/anyscale-foundation-modules/kubernetes/latest for a reference on provisioning the core cloud resources required for cloud registration.

Deployment

Download the Helm chart and save it to a local directory.

Then, sign in to your Anyscale account using anyscale login, and proceed with the following steps:

anyscale cloud register --name <cloud-name> \
--provider aws \
--region <region> \
--compute-stack k8s \
--kubernetes-namespaces <namespace> \
--kubernetes-ingress-external-address <kubernetes-ingress-external-address-or-ip> \
--kubernetes-zones <comma-separated-zones> \
--kubernetes-dataplane-identity <data-plane-iam-role-arn> \
--anyscale-iam-role-id <control-plane-iam-role-arn> \
--s3-bucket-id <s3-bucket-arn> \
--efs-id <efs-id>

helm upgrade <release-name> ./chart \
--set-string cloudDeploymentId=<cloud-deployment-id> \
--set-string cloudProvider=aws \
--set-string region=<region> \
--set-string workloadServiceAccountName=anyscale-operator \
--namespace <namespace> \
--create-namespace \
-i

At this point, the Anyscale operator should come up and start posting health checks to the Anyscale Control Plane. You should be ready to run workloads as you normally would on Anyscale clouds.

Try to submit a job to verify the Anyscale operator installation:

anyscale job submit --cloud <cloud-name> --working-dir https://github.com/anyscale/docs_examples/archive/refs/heads/main.zip -- python hello_world.py

Configuration options

End-users of Anyscale features (data scientists, ML engineers, etc.) submit workloads to Anyscale by defining compute configs, which allow them control over instance types and shapes that their app requires. As an example, consider the following compute configuration for a Ray workload that requires some CPU workers, and some A10G workers on AWS:

cloud: aws-cloud
zones:
- us-west-2a
- us-west-2b
head_node:
instance_type: m5.8xlarge
worker_nodes:
- instance_type: m5.8xlarge
min_nodes: 0
max_nodes: 5
market_type: PREFER_SPOT
- instance_type: g5.4xlarge
min_nodes: 0
max_nodes: 5
market_type: ON_DEMAND

The Anyscale operator supports all of these features (zone selection, instance type selection, market type selection), but requires customization to integrate with cluster-specific properties. Many of these properties are set through Helm chart options.

Instance Type ConfigMap

When running on top of Kubernetes, an Anyscale "instance type" maps to a Pod shape. The cloud administrator defines instance types when setting up the Anyscale operator through either the Helm chart options or out-of-band by editing the instance-types ConfigMap that the Helm chart creates.

Here is an example of what the generated ConfigMap may look like -

(base) [~]$ k get configmap instance-types -o yaml
apiVersion: v1
data:
instance_types.yaml: |-
# A small CPU-only shape.
2CPU-8GB:
resources:
CPU: 2
memory: 8Gi
# A larger shape with both CPU and GPU.
8CPU-32GB-1xT4:
resources:
CPU: 8
GPU: 1
accelerator_type:T4: 1
memory: 32Gi
version: v1

2CPU-8GB and 8CPU-32GB-1xT4 are names that follow an Anyscale naming convention. Cloud administrators may use a naming convention of their choice - valid characters include alphanumeric characters, dashes, and underscores.

Each instance type defined in the ConfigMap is visible in the Anyscale UI through a drop-down list. Users can select these instance types when submitting workloads. Users may also define compute configs that use these instance types through the Anyscale CLI/SDK.

The Anyscale console is updated roughly every ~30 seconds with the latest instance types defined in the ConfigMap.

For accelerators, the accelerator_type value should map to the list of Ray-supported accelerators. If an accelerator type isn't defined in this list, open an issue on the Ray GitHub repository, and forward it to Anyscale support.

When the Anyscale operator applies a pod spec to Kubernetes for an Anyscale workload, the operator uses the shapes defined in the Instance Type ConfigMap as an upper bound for the sum of all of the memory requests & limits across all containers in the pod. Anyscale reserves some memory / CPU for critical-path Anyscale sidecar containers, and provides the rest to the Ray container to run the primary workload.

Advanced: Patch ConfigMap

Different Kubernetes clusters have some variance when it comes to spot handling / accelerator handling / etc.

The Patch API provides an escape hatch to handle custom integrations. This API allows for just-in-time patching of all Anyscale-managed resources as they're applied to the Kubernetes cluster. The syntax used for the Patch API is the JSON Patch syntax (ITEF specification). As an example, consider the patch below:

patches:
- kind: Pod
# See: https://kubernetes.io/docs/concepts/overview/working-with-objects/labels/#label-selectors
selector: "anyscale.com/market-type in (ON_DEMAND)"
# See: https://jsonpatch.com/
patch:
- op: add
path: /spec/nodeSelector/eks.amazonaws.com~1capacityType # use ~1 to escape the forward-slash
value: "ON_DEMAND"

For all Pods that the Anyscale operator creates, the operator applies the set of patches to all pods that match the Kubernetes selector. In this case, the operator applies the eks.amazonaws.com/capacityType node selector to the Pod spec.

The Helm chart generates a variety of patches using the default configuration options that should work on EKS/GKE out-of-the-box without additional configuration. Additional patches to support custom autoscalers, ingresses, or other cluster-specific properties may be provided through the Helm chart.

View all labels that can be used for selection / patching

Label NamePossible Label ValuesDescription
anyscale.com/market-typeSPOT, ON_DEMANDUsers with workloads that support preemption may opt to run their workloads on spot node types through the compute config. All other workloads are run on on-demand node types. This should most likely be transformed into a node affinity.
anyscale.com/zoneuser-defined through cloud setupFor Pods that have a specific zone affinity, the Anyscale operator sets this label to the zone that the Pod should be launched into (us-west-2a, for example). Zones are provided as []string at cloud registration time and can be selected from the Anyscale UI. This should most likely be transformed into a node affinity.
anyscale.com/accelerator-typeuser-defined through instance type configurationWhen requesting a GPU Pod, the Anyscale operator sets one of the following values: Anyscale accelerator types.
anyscale.com/instance-typeuser-defined through instance type configurationThe operator sets this value for all Pods created through Anyscale.
anyscale.com/canary-weight
anyscale.com/canary-exists
anyscale.com/canary-svc
anyscale.com/ingress-type
anyscale.com/bearer-token
anyscale.com/primary-weight
anyscale.com/primary-svc
variousFor advanced use only (when using an ingress other than NGINX for inference / serving workloads with Anyscale Services). Contact Anyscale for more details.

Uninstall the Anyscale operator

View uninstallation instructions

To uninstall the Anyscale operator, run the following command:

helm uninstall <release-name> -n <namespace>
kubectl delete namespace <namespace>

To delete the cloud, run the following command:

anyscale cloud delete --name <cloud-name>