Skip to main content

Anyscale operator for Kubernetes

info

The Anyscale operator for Kubernetes is in developer preview.

The Anyscale operator for Kubernetes enables deploying the Anyscale platform on Kubernetes clusters on Amazon Elastic Kubernetes Service (EKS), Google Kubernetes Engine (GKE), Oracle Kubernetes Engine (OKE), Azure Kubernetes Service (AKS), CoreWeave, or other Kubernetes clusters running in the cloud or on-prem. See the diagram below for a high-level overview of the Anyscale operator:

View resources the Anyscale operator operates on.

Namespaced resources

  • Pods: each Anyscale / Ray node maps to a single pod.
  • Services + Ingresses: used for head node connectivity (user laptop -> Ray dashboard) and for exposing Anyscale services (user laptop -> Anyscale Service). Ingresses may be either private or public.
  • Secrets: used to hold secrets used by the Anyscale operator.
  • ConfigMaps: used to store configuration options for the Anyscale operator.
  • Events: used to enhance workload observability.

Global resources

  • TokenReview: On the startup of an Anyscale node in an Anyscale workload, Anyscale uses the Kubernetes TokenReview API to verify a pod's identity when the pod bootstraps itself to the Anyscale control plane.
  • Nodes: The operator periodically reads node information to enhance workload observability.

Installing the Helm chart for the Anyscale operator requires permissions to create cluster roles and cluster role bindings, which grant the Anyscale operator the necessary permissions to manage the preceding global resources. If you don't have these permissions, consider deploying Anyscale inside of vCluster in a namespace of your choice.

Deployment modes

Cloud-native mode (only supported for AWS and GCP)

Cloud-native mode comes with first-class support for all Anyscale features, but requires setting up additional peripheral cloud resources (S3 buckets, IAM roles, etc.) before deploying the Anyscale operator. At this time, cloud-native mode is only supported on AWS and GCP. See the Terraform modules for a reference on these peripheral cloud resources required for cloud registration.

Cloud-agnostic mode (supported for any Kubernetes cluster)

Cloud-agnostic mode is more flexible and doesn't necessarily require setting up peripheral cloud resources. However, some Anyscale features, such as viewing logs through the Anyscale console, may be missing or unsupported unless the relevant cloud resources have been provided.

tip

If running on EKS or GKE, use cloud-native mode when possible.

Prerequisites

  • A Kubernetes cluster.
    • Use Kubernetes v1.28 or later when possible. Earlier versions may work, but aren't fully tested.
  • Permissions to deploy a Helm chart into the Kubernetes cluster.
  • The name of the Kubernetes namespace that you would like to deploy the Anyscale operator inside of.
  • An ingress controller. Use the Ingress-NGINX controller when possible. Other ingress controllers may work as well, but aren't fully tested.
    • For direct networking, configure an internet-facing load balancer.
    • For customer-defined networking, configure an internal load balancer.
      • In some cases, an annotation on the LoadBalancer service in front of the NGINX pods can be applied to configure internal load balancing.
    • As a reference, see this link for the difference between direct and customer-defined networking modes on the AWS VM stack (+ the pros/cons of each approach).
  • An IP or hostname that resolves to your ingress.
    • For public clouds, this should be a public IP or hostname that resolves to a public IP.
    • For private clouds, this should be a private IP or hostname that resolves to a private IP.
  • An S3 bucket for system and artifact storage.
    • All Pods created by Anyscale must have direct access to this storage bucket.
  • An IAM role for the Anyscale operator to use, for the purposes of verifying the operator identity.

See https://registry.terraform.io/modules/anyscale/anyscale-foundation-modules/kubernetes/latest for a reference on provisioning the core cloud resources required for cloud registration.

Deployment

Add the Anyscale Helm chart repository

helm repo add anyscale https://anyscale.github.io/helm-charts
helm repo update anyscale

Then, sign in to your Anyscale account using anyscale login, and proceed with the following steps:

anyscale cloud register --name <cloud-name> \
--provider aws \
--region <region> \
--compute-stack k8s \
--kubernetes-zones <comma-separated-zones> \
--anyscale-operator-iam-identity <anyscale-operator-iam-role-arn> \
--s3-bucket-id <s3-bucket-arn> \
--efs-id <efs-id>

helm upgrade <release-name> anyscale/anyscale-operator \
--set-string cloudDeploymentId=<cloud-deployment-id> \
--set-string cloudProvider=aws \
--set-string region=<region> \
--set-string workloadServiceAccountName=anyscale-operator \
--namespace <namespace> \
--create-namespace \
-i

At this point, the Anyscale operator should come up and start posting health checks to the Anyscale Control Plane. You should be ready to run workloads as you normally would on Anyscale clouds.

Try to submit a job to verify the Anyscale operator installation:

anyscale job submit --cloud <cloud-name> --working-dir https://github.com/anyscale/docs_examples/archive/refs/heads/main.zip -- python hello_world.py

Configuration options

End-users of Anyscale features (data scientists, ML engineers, etc.) submit workloads to Anyscale by defining compute configs, which allow them control over instance types and shapes that their app requires. As an example, consider the following compute configuration for a Ray workload that requires some CPU workers, and some A10G workers on AWS:

cloud: aws-cloud
zones:
- us-west-2a
- us-west-2b
head_node:
instance_type: m5.8xlarge
worker_nodes:
- instance_type: m5.8xlarge
min_nodes: 0
max_nodes: 5
market_type: PREFER_SPOT
- instance_type: g5.4xlarge
min_nodes: 0
max_nodes: 5
market_type: ON_DEMAND

The Anyscale operator supports all of these features (zone selection, instance type selection, market type selection), but requires customization to integrate with cluster-specific properties. Many of these properties are set through Helm chart options.

Instance Type ConfigMap

When running on top of Kubernetes, an Anyscale "instance type" maps to a Pod shape. The cloud administrator defines instance types when setting up the Anyscale operator through either the Helm chart options or out-of-band by editing the instance-types ConfigMap that the Helm chart creates.

Here is an example of what the generated ConfigMap may look like -

(base) [~]$ k get configmap instance-types -o yaml
apiVersion: v1
data:
instance_types.yaml: |-
# A small CPU-only shape.
2CPU-8GB:
resources:
CPU: 2
memory: 8Gi
# A larger shape with both CPU and GPU.
8CPU-32GB-1xT4:
resources:
CPU: 8
GPU: 1
accelerator_type:T4: 1
memory: 32Gi
version: v1

2CPU-8GB and 8CPU-32GB-1xT4 are names that follow an Anyscale naming convention. Cloud administrators may use a naming convention of their choice - valid characters include alphanumeric characters, dashes, and underscores.

Each instance type defined in the ConfigMap is visible in the Anyscale UI through a drop-down list. Users can select these instance types when submitting workloads. Users may also define compute configs that use these instance types through the Anyscale CLI/SDK.

The Anyscale console is updated roughly every ~30 seconds with the latest instance types defined in the ConfigMap.

For accelerators, the accelerator_type value should map to the list of Ray-supported accelerators. If an accelerator type isn't defined in this list, open an issue on the Ray GitHub repository, and forward it to Anyscale support.

When the Anyscale operator applies a pod spec to Kubernetes for an Anyscale workload, the operator uses the shapes defined in the Instance Type ConfigMap as an upper bound for the sum of all of the memory requests & limits across all containers in the pod. Anyscale reserves some memory / CPU for critical-path Anyscale sidecar containers, and provides the rest to the Ray container to run the primary workload.

Advanced: Patch ConfigMap

Different Kubernetes clusters have some variance when it comes to spot handling / accelerator handling / etc.

The Patch API provides an escape hatch to handle custom integrations. This API allows for just-in-time patching of all Anyscale-managed resources as they're applied to the Kubernetes cluster. The syntax used for the Patch API is the JSON Patch syntax (ITEF specification). As an example, consider the patch below:

patches:
- kind: Pod
# See: https://kubernetes.io/docs/concepts/overview/working-with-objects/labels/#label-selectors
selector: "anyscale.com/market-type in (ON_DEMAND)"
# See: https://jsonpatch.com/
patch:
- op: add
path: /spec/nodeSelector/eks.amazonaws.com~1capacityType # use ~1 to escape the forward-slash
value: "ON_DEMAND"

For all Pods that the Anyscale operator creates, the operator applies the set of patches to all pods that match the Kubernetes selector. In this case, the operator applies the eks.amazonaws.com/capacityType node selector to the Pod spec.

The Helm chart generates a variety of patches using the default configuration options that should work on EKS or GKE out-of-the-box without additional configuration. The Helm chart also accepts additional patches to support custom autoscalers, ingresses, or other cluster-specific properties.

View example patches

These patches may require slight modifications to work with your Kubernetes cluster setup, because versions of downstream resources may have changed since the time we wrote these patches. Use them as a starting point for using different types of downstream resources.

Using the AWS Load Balancer Controller

First, create a temporary minimal ingress with a fixed group name, such as:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: minimal-ingress
annotations:
alb.ingress.kubernetes.io/group.name: anyscale
spec:
rules:
- http:
paths:
- path: /testpath
pathType: Prefix
backend:
service:
name: test
port:
number: 80

Describe the ingress (kubectl describe ingress minimal-ingress) to retrieve the status.loadBalancer.ingress.hostname attribute. It should look resemble anyscale-1421684687.us-west-2.elb.amazonaws.com, which is an address that stays consistent for all ingresses you create with this group name.

Use this value for cloud registration for the --kubernetes-ingress-external-address flag.

After you retrieve the address, delete the temporary ingress.

Then, apply this set of additional patches.

additionalPatches:
# Apply these patches to all `Ingress` resources created by Anyscale.
- kind: Ingress
patch:
- op: add
path: /metadata/annotations/alb.ingress.kubernetes.io~1group.name
value: "anyscale"
- op: add
path: /metadata/annotations/alb.ingress.kubernetes.io~1load-balancer-name
value: "anyscale"
# Uncomment this if you want to use an internal ALB, which is only accessible
# from inside the VPC and requires a VPN to access from a laptop.
# - op: add
# path: /metadata/annotations/alb.ingress.kubernetes.io~1scheme
# value: "internal"

# NOTE: The rest of the patches are only required for Anyscale services functionality.
# They aren't required for basic head node connectivity for workspaces and jobs.

# When the Anyscale operator performs a service rollout, two ingress resources are
# created for the NGINX Ingress Controller. The first ingress is the primary ingress,
# and the second ingress is the canary ingress.

# The Anyscale operator uses the following patches to convert from the NGINX Ingress Controller
# scheme to the ALB Ingress Controller scheme, handling the case where a single ingress and service
# exist and the case where a rollout is in progress and two ingresses and services exist.

# The ALB Ingress Controller doesn't need two ingresses to manage canary deployments.
# Instead, the ALB Ingress Controller can manage canary deployments through a single
# ingress resource. This patch modifies the ingress resource created by the Anyscale
# operator to use the ALB Ingress Controller scheme, by updating the annotations on
# the primary ingress when a canary exists.
- kind: Ingress
selector: anyscale.com/ingress-type in (primary), anyscale.com/canary-exists in (true)
patch:
- op: add
path: /metadata/annotations/alb.ingress.kubernetes.io~1actions.anyscale
value: >
{
"type": "forward",
"forwardConfig": {
"targetGroups": [
{
"serviceName": "{{.PrimarySvc}}",
"servicePort": "8000",
"weight": {{.PrimaryWeight}}
},
{
"serviceName": "{{.CanarySvc}}",
"servicePort": "8000",
"weight": {{.CanaryWeight}}
}
],
"targetGroupStickinessConfig": {
"enabled": false
}
}
}
# Update the serviceName and servicePort to point to the action name
# so that the rules in the annotation are used. For more information,
# see:
# https://kubernetes-sigs.github.io/aws-load-balancer-controller/v2.2/guide/ingress/annotations/#actions
- op: replace
path: /spec/rules/0/http/paths/0/backend/service/name
value: anyscale
- op: replace
path: /spec/rules/0/http/paths/0/backend/service/port/name
value: use-annotation

# This patch handles the primary ingress when a canary doesn't exist (for
# example, when a service rollout isn't in progress) by adding a set of actions
# to forward traffic to the primary service.
- kind: Ingress
selector: anyscale.com/ingress-type in (primary), anyscale.com/canary-exists in (false)
patch:
- op: add
path: /metadata/annotations/alb.ingress.kubernetes.io~1actions.anyscale
value: >
{
"type": "forward",
"forwardConfig": {
"targetGroups": [
{
"serviceName": "{{.PrimarySvc}}",
"servicePort": "8000",
"weight": {{.PrimaryWeight}}
}
],
"targetGroupStickinessConfig": {
"enabled": false
}
}
}
# Update the serviceName and servicePort to point to the action name
# so that the rules in the annotation are used. For more information,
# see:
# https://kubernetes-sigs.github.io/aws-load-balancer-controller/v2.2/guide/ingress/annotations/#actions
- op: replace
path: /spec/rules/0/http/paths/0/backend/service/name
value: anyscale
- op: replace
path: /spec/rules/0/http/paths/0/backend/service/port/name
value: use-annotation

# This patch handles the canary ingress by rewriting it into a no-op ingress. The ALB Ingress
# Controller doesn't need a separate ingress for canary deployments, so this patch no-ops it.
- kind: Ingress
selector: "anyscale.com/ingress-type in (canary)"
patch:
- op: replace
path: /spec
value:
defaultBackend:
service:
name: default-backend
port:
number: 80

View all annotations provided by Anyscale that you can use for custom patches

These annotations are applied by the Anyscale control plane on resources created by Anyscale.

Label NamePossible Label ValuesDescription
anyscale.com/market-typeSPOT, ON_DEMANDUsers with workloads that support preemption may opt to run their workloads on spot node types through the compute config. All other workloads are run on on-demand node types. This should most likely be transformed into a node affinity.
anyscale.com/zoneuser-defined through cloud setupFor Pods that have a specific zone affinity, the Anyscale operator sets this label to the zone that the Pod should be launched into (us-west-2a, for example). Zones are provided as []string at cloud registration time and can be selected from the Anyscale UI. This should most likely be transformed into a node affinity.
anyscale.com/accelerator-typeuser-defined through instance type configurationWhen requesting a GPU Pod, the Anyscale operator sets one of the following values: Anyscale accelerator types.
anyscale.com/instance-typeuser-defined through instance type configurationThe operator sets this value for all Pods created through Anyscale.
anyscale.com/canary-weight
anyscale.com/canary-exists
anyscale.com/canary-svc
anyscale.com/ingress-type
anyscale.com/bearer-token
anyscale.com/primary-weight
anyscale.com/primary-svc
variousFor advanced use only (when using an ingress other than NGINX for inference / serving workloads with Anyscale Services). Contact Anyscale for more details.

Advanced: Compute configuration options

The compute configuration allows you to provide workload-scoped advanced configuration settings for either a specific node or the entire cluster. Anyscale applies this configuration as a strategic merge patch to the Pod specifications generated by Anyscale before sending them to the Kubernetes API.

Node-specific configurations override any cluster-wide configurations for that node type. For a full reference on configuring these properties, see the official Kubernetes documentation.

View a sample of common advanced configuration options.

{
"metadata": {
// Add a new label.
"labels": {"new-label": "example-value"},
// Add a new annotation.
"annotations": {"new-annotation": "example-value"}
},
"spec": {
// Add a node selector.
"nodeSelector": {"disktype": "ssd"},
"tolerations": [{
"effect": "NoSchedule",
"key": "dedicated",
"value": "example-anyscale"
}]
"containers": [{
// Add a PersistentVolumeClaim to the Ray container.
"name": "ray",
"volumeMounts": [{
"name": "pvc-volume",
"mountPath": "/mnt/pvc-data"
}]
},{
// Add a sidecar for exporting logs/metrics.
"name": "monitoring-sidecar",
"image": "timberio/vector:latest",
"ports": [{
"containerPort": 9000
}],
"volumeMounts": [{
"name": "vector-volume",
"mountPath": "/mnt/vector-data"
}]
}],
"volumes": [{
"name": "pvc-volume",
"persistentVolumeClaim": {
"claimName": "my-pvc"
}
},{
"name": "vector-volume",
"emptyDir": {}
}]
}
}

Uninstall the Anyscale operator

View uninstallation instructions

To uninstall the Anyscale operator, run the following command:

helm uninstall <release-name> -n <namespace>
kubectl delete namespace <namespace>

To delete the cloud, run the following command:

anyscale cloud delete --name <cloud-name>