Anyscale operator for Kubernetes
The Anyscale operator for Kubernetes is in developer preview.
The Anyscale operator for Kubernetes enables deploying the Anyscale platform on Kubernetes clusters on Amazon Elastic Kubernetes Service (EKS), Google Kubernetes Engine (GKE), Oracle Kubernetes Engine (OKE), Azure Kubernetes Service (AKS), CoreWeave, or other Kubernetes clusters running in the cloud or on-prem. See the diagram below for a high-level overview of the Anyscale operator:
View resources the Anyscale operator operates on.
Namespaced resources
- Pods: each Anyscale / Ray node maps to a single pod.
- Services + Ingresses: used for head node connectivity (user laptop -> Ray dashboard) and for exposing Anyscale services (user laptop -> Anyscale Service). Ingresses may be either private or public.
- Secrets: used to hold secrets used by the Anyscale operator.
- ConfigMaps: used to store configuration options for the Anyscale operator.
- Events: used to enhance workload observability.
Global resources
- TokenReview: On the startup of an Anyscale node in an Anyscale workload, Anyscale uses the Kubernetes TokenReview API to verify a pod's identity when the pod bootstraps itself to the Anyscale control plane.
- Nodes: The operator periodically reads node information to enhance workload observability.
Installing the Helm chart for the Anyscale operator requires permissions to create cluster roles and cluster role bindings, which grant the Anyscale operator the necessary permissions to manage the preceding global resources. If you don't have these permissions, consider deploying Anyscale inside of vCluster in a namespace of your choice.
Deployment modes
Cloud-native mode (only supported for AWS and GCP)
Cloud-native mode comes with first-class support for all Anyscale features, but requires setting up additional peripheral cloud resources (S3 buckets, IAM roles, etc.) before deploying the Anyscale operator. At this time, cloud-native mode is only supported on AWS and GCP. See the Terraform modules for a reference on these peripheral cloud resources required for cloud registration.
Cloud-agnostic mode (supported for any Kubernetes cluster)
Cloud-agnostic mode is more flexible and doesn't necessarily require setting up peripheral cloud resources. However, some Anyscale features, such as viewing logs through the Anyscale console, may be missing or unsupported unless the relevant cloud resources have been provided.
If running on EKS or GKE, use cloud-native mode when possible.
Prerequisites
- A Kubernetes cluster.
- Use Kubernetes v1.28 or later when possible. Earlier versions may work, but aren't fully tested.
- Permissions to deploy a Helm chart into the Kubernetes cluster.
- The name of the Kubernetes namespace that you would like to deploy the Anyscale operator inside of.
- An ingress controller. Use the Ingress-NGINX controller when possible. Other ingress controllers may work as well, but aren't fully tested.
- For direct networking, configure an internet-facing load balancer.
- For customer-defined networking, configure an internal load balancer.
- In some cases, an annotation on the LoadBalancer service in front of the NGINX pods can be applied to configure internal load balancing.
- As a reference, see this link for the difference between direct and customer-defined networking modes on the AWS VM stack (+ the pros/cons of each approach).
- An IP or hostname that resolves to your ingress.
- For public clouds, this should be a public IP or hostname that resolves to a public IP.
- For private clouds, this should be a private IP or hostname that resolves to a private IP.
- Cloud-native, AWS
- Cloud-native, GCP
- Cloud-agnostic
- An S3 bucket for system and artifact storage.
- All Pods created by Anyscale must have direct access to this storage bucket.
- An IAM role for the Anyscale operator to use, for the purposes of verifying the operator identity.
See https://registry.terraform.io/modules/anyscale/anyscale-foundation-modules/kubernetes/latest for a reference on provisioning the core cloud resources required for cloud registration.
- A GCS bucket for system and artifact storage.
- All Pods created by Anyscale must have direct access to this storage bucket.
- The project ID of the Google Project that contains the target Kubernetes cluster.
- A service account for the Anyscale operator, for the purposes of verifying the operator identity.
See https://registry.terraform.io/modules/anyscale/anyscale-foundation-modules/kubernetes/latest for a reference on provisioning the core cloud resources required for cloud registration.
- A cloud storage bucket (optional, highly recommended). Supported storage buckets include Google Cloud Storage buckets, S3 buckets, or S3-compatible buckets (
s3://<bucket-name>
orgs://<bucket-name>
).- Anyscale uses this cloud storage bucket for persisting various system artifacts in the customer account, including runtime environment uploads from
anyscale job
andanyscale service
CLI commands. - If desired, an endpoint URL to override the default
AWS_ENDPOINT_URL
.
- Anyscale uses this cloud storage bucket for persisting various system artifacts in the customer account, including runtime environment uploads from
- An NFS mount target (optional, highly recommended).
- Anyscale uses NFS for Anyscale Workspaces persistence, as well as cluster shared storage.
- If desired, a path to pass into the NFS volume specification.
NOTE: Some Anyscale features, such as log viewing through the UI, aren't supported in cloud-agnostic mode at this time.
Deployment
Add the Anyscale Helm chart repository
helm repo add anyscale https://anyscale.github.io/helm-charts
helm repo update anyscale
Then, sign in to your Anyscale account using anyscale login
, and proceed with the following steps:
- Cloud-native, AWS
- Cloud-native, GCP
- Cloud-agnostic
anyscale cloud register --name <cloud-name> \
--provider aws \
--region <region> \
--compute-stack k8s \
--kubernetes-zones <comma-separated-zones> \
--anyscale-operator-iam-identity <anyscale-operator-iam-role-arn> \
--s3-bucket-id <s3-bucket-arn> \
--efs-id <efs-id>
helm upgrade <release-name> anyscale/anyscale-operator \
--set-string cloudDeploymentId=<cloud-deployment-id> \
--set-string cloudProvider=aws \
--set-string region=<region> \
--set-string workloadServiceAccountName=anyscale-operator \
--namespace <namespace> \
--create-namespace \
-i
anyscale cloud register --name <cloud-name> \
--provider gcp \
--region <region> \
--compute-stack k8s \
--kubernetes-zones <comma-separated-zones> \
--anyscale-operator-iam-identity <anyscale-operator-service-account-email> \
--cloud-storage-bucket-name <cloud-storage-bucket-name> \
--project-id <project-id> \ # only required if using NFS
--vpc-name <vpc-name> \ # used to discover NFS mount targets from the Filestore instance below
--filestore-instance-id <filestore-instance-id> \
--filestore-location <filestore-location>
helm upgrade <release-name> anyscale/anyscale-operator \
--set-string cloudDeploymentId=<cloud-deployment-id> \
--set-string cloudProvider=gcp \
--set-string region=<region> \
--set-string operatorIamIdentity=<anyscale-operator-service-account-email> \
--set-string workloadServiceAccountName=anyscale-operator \
--namespace <namespace> \
--create-namespace \
-i
gcloud iam service-accounts add-iam-policy-binding <anyscale-operator-service-account-email> \
--role roles/iam.workloadIdentityUser \
--member "serviceAccount:<project-id>.svc.id.goog[<namespace>/anyscale-operator]"
anyscale cloud register --name <cloud-name> \
--provider generic \
--compute-stack k8s \
--cloud-storage-bucket-name <(s3:// or gcs://)> \
--cloud-storage-bucket-endpoint <(https://object.lga1.coreweave.com/, for example)> \
--nfs-mount-target <(passed to the "server" attr. of the NFS volume spec)> \
--nfs-mount-path <(passed to the "path" attr. of the NFS volume spec)>
# Acquire an ANYSCALE_CLI_TOKEN from the Anyscale console, and set it as an environment variable.
export ANYSCALE_CLI_TOKEN=<cli-token>
helm upgrade <release-name> anyscale/anyscale-operator \
--set-string cloudDeploymentId=<cloudDeploymentId> \
--set-string cloudProvider=generic \
--set-string anyscaleCliToken=$ANYSCALE_CLI_TOKEN \
--namespace <namespace> \
--create-namespace \
-i
At this point, the Anyscale operator should come up and start posting health checks to the Anyscale Control Plane. You should be ready to run workloads as you normally would on Anyscale clouds.
Try to submit a job to verify the Anyscale operator installation:
anyscale job submit --cloud <cloud-name> --working-dir https://github.com/anyscale/docs_examples/archive/refs/heads/main.zip -- python hello_world.py
Configuration options
End-users of Anyscale features (data scientists, ML engineers, etc.) submit workloads to Anyscale by defining compute configs, which allow them control over instance types and shapes that their app requires. As an example, consider the following compute configuration for a Ray workload that requires some CPU workers, and some A10G workers on AWS:
cloud: aws-cloud
zones:
- us-west-2a
- us-west-2b
head_node:
instance_type: m5.8xlarge
worker_nodes:
- instance_type: m5.8xlarge
min_nodes: 0
max_nodes: 5
market_type: PREFER_SPOT
- instance_type: g5.4xlarge
min_nodes: 0
max_nodes: 5
market_type: ON_DEMAND
The Anyscale operator supports all of these features (zone selection, instance type selection, market type selection), but requires customization to integrate with cluster-specific properties. Many of these properties are set through Helm chart options.
Instance Type ConfigMap
When running on top of Kubernetes, an Anyscale "instance type" maps to a Pod shape. The cloud administrator defines instance types when setting up the Anyscale operator through either the Helm chart options or out-of-band by editing the instance-types
ConfigMap that the Helm chart creates.
Here is an example of what the generated ConfigMap may look like -
(base) [~]$ k get configmap instance-types -o yaml
apiVersion: v1
data:
instance_types.yaml: |-
# A small CPU-only shape.
2CPU-8GB:
resources:
CPU: 2
memory: 8Gi
# A larger shape with both CPU and GPU.
8CPU-32GB-1xT4:
resources:
CPU: 8
GPU: 1
accelerator_type:T4: 1
memory: 32Gi
version: v1
2CPU-8GB
and 8CPU-32GB-1xT4
are names that follow an Anyscale naming convention. Cloud administrators may use a naming convention of their choice - valid characters include alphanumeric characters, dashes, and underscores.
Each instance type defined in the ConfigMap is visible in the Anyscale UI through a drop-down list. Users can select these instance types when submitting workloads. Users may also define compute configs that use these instance types through the Anyscale CLI/SDK.
The Anyscale console is updated roughly every ~30 seconds with the latest instance types defined in the ConfigMap.
For accelerators, the accelerator_type
value should map to the list of Ray-supported accelerators. If an accelerator type isn't defined in this list, open an issue on the Ray GitHub repository, and forward it to Anyscale support.
When the Anyscale operator applies a pod spec to Kubernetes for an Anyscale workload, the operator uses the shapes defined in the Instance Type ConfigMap as an upper bound for the sum of all of the memory requests & limits across all containers in the pod. Anyscale reserves some memory / CPU for critical-path Anyscale sidecar containers, and provides the rest to the Ray container to run the primary workload.
Advanced: Patch ConfigMap
Different Kubernetes clusters have some variance when it comes to spot handling / accelerator handling / etc.
The Patch API provides an escape hatch to handle custom integrations. This API allows for just-in-time patching of all Anyscale-managed resources as they're applied to the Kubernetes cluster. The syntax used for the Patch API is the JSON Patch syntax (ITEF specification). As an example, consider the patch below:
patches:
- kind: Pod
# See: https://kubernetes.io/docs/concepts/overview/working-with-objects/labels/#label-selectors
selector: "anyscale.com/market-type in (ON_DEMAND)"
# See: https://jsonpatch.com/
patch:
- op: add
path: /spec/nodeSelector/eks.amazonaws.com~1capacityType # use ~1 to escape the forward-slash
value: "ON_DEMAND"
For all Pods that the Anyscale operator creates, the operator applies the set of patches to all pods that match the Kubernetes selector. In this case, the operator applies the eks.amazonaws.com/capacityType
node selector to the Pod spec.
The Helm chart generates a variety of patches using the default configuration options that should work on EKS or GKE out-of-the-box without additional configuration. The Helm chart also accepts additional patches to support custom autoscalers, ingresses, or other cluster-specific properties.
View example patches
These patches may require slight modifications to work with your Kubernetes cluster setup, because versions of downstream resources may have changed since the time we wrote these patches. Use them as a starting point for using different types of downstream resources.
Using the AWS Load Balancer Controller
First, create a temporary minimal ingress with a fixed group name, such as:
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: minimal-ingress
annotations:
alb.ingress.kubernetes.io/group.name: anyscale
spec:
rules:
- http:
paths:
- path: /testpath
pathType: Prefix
backend:
service:
name: test
port:
number: 80
Describe the ingress (kubectl describe ingress minimal-ingress
) to retrieve the status.loadBalancer.ingress.hostname
attribute. It should look resemble anyscale-1421684687.us-west-2.elb.amazonaws.com
, which is an address that stays consistent for all ingresses you create with this group name.
Use this value for cloud registration for the --kubernetes-ingress-external-address
flag.
After you retrieve the address, delete the temporary ingress.
Then, apply this set of additional patches.
additionalPatches:
# Apply these patches to all `Ingress` resources created by Anyscale.
- kind: Ingress
patch:
- op: add
path: /metadata/annotations/alb.ingress.kubernetes.io~1group.name
value: "anyscale"
- op: add
path: /metadata/annotations/alb.ingress.kubernetes.io~1load-balancer-name
value: "anyscale"
# Uncomment this if you want to use an internal ALB, which is only accessible
# from inside the VPC and requires a VPN to access from a laptop.
# - op: add
# path: /metadata/annotations/alb.ingress.kubernetes.io~1scheme
# value: "internal"
# NOTE: The rest of the patches are only required for Anyscale services functionality.
# They aren't required for basic head node connectivity for workspaces and jobs.
# When the Anyscale operator performs a service rollout, two ingress resources are
# created for the NGINX Ingress Controller. The first ingress is the primary ingress,
# and the second ingress is the canary ingress.
# The Anyscale operator uses the following patches to convert from the NGINX Ingress Controller
# scheme to the ALB Ingress Controller scheme, handling the case where a single ingress and service
# exist and the case where a rollout is in progress and two ingresses and services exist.
# The ALB Ingress Controller doesn't need two ingresses to manage canary deployments.
# Instead, the ALB Ingress Controller can manage canary deployments through a single
# ingress resource. This patch modifies the ingress resource created by the Anyscale
# operator to use the ALB Ingress Controller scheme, by updating the annotations on
# the primary ingress when a canary exists.
- kind: Ingress
selector: anyscale.com/ingress-type in (primary), anyscale.com/canary-exists in (true)
patch:
- op: add
path: /metadata/annotations/alb.ingress.kubernetes.io~1actions.anyscale
value: >
{
"type": "forward",
"forwardConfig": {
"targetGroups": [
{
"serviceName": "{{.PrimarySvc}}",
"servicePort": "8000",
"weight": {{.PrimaryWeight}}
},
{
"serviceName": "{{.CanarySvc}}",
"servicePort": "8000",
"weight": {{.CanaryWeight}}
}
],
"targetGroupStickinessConfig": {
"enabled": false
}
}
}
# Update the serviceName and servicePort to point to the action name
# so that the rules in the annotation are used. For more information,
# see:
# https://kubernetes-sigs.github.io/aws-load-balancer-controller/v2.2/guide/ingress/annotations/#actions
- op: replace
path: /spec/rules/0/http/paths/0/backend/service/name
value: anyscale
- op: replace
path: /spec/rules/0/http/paths/0/backend/service/port/name
value: use-annotation
# This patch handles the primary ingress when a canary doesn't exist (for
# example, when a service rollout isn't in progress) by adding a set of actions
# to forward traffic to the primary service.
- kind: Ingress
selector: anyscale.com/ingress-type in (primary), anyscale.com/canary-exists in (false)
patch:
- op: add
path: /metadata/annotations/alb.ingress.kubernetes.io~1actions.anyscale
value: >
{
"type": "forward",
"forwardConfig": {
"targetGroups": [
{
"serviceName": "{{.PrimarySvc}}",
"servicePort": "8000",
"weight": {{.PrimaryWeight}}
}
],
"targetGroupStickinessConfig": {
"enabled": false
}
}
}
# Update the serviceName and servicePort to point to the action name
# so that the rules in the annotation are used. For more information,
# see:
# https://kubernetes-sigs.github.io/aws-load-balancer-controller/v2.2/guide/ingress/annotations/#actions
- op: replace
path: /spec/rules/0/http/paths/0/backend/service/name
value: anyscale
- op: replace
path: /spec/rules/0/http/paths/0/backend/service/port/name
value: use-annotation
# This patch handles the canary ingress by rewriting it into a no-op ingress. The ALB Ingress
# Controller doesn't need a separate ingress for canary deployments, so this patch no-ops it.
- kind: Ingress
selector: "anyscale.com/ingress-type in (canary)"
patch:
- op: replace
path: /spec
value:
defaultBackend:
service:
name: default-backend
port:
number: 80
View all annotations provided by Anyscale that you can use for custom patches
These annotations are applied by the Anyscale control plane on resources created by Anyscale.
Label Name | Possible Label Values | Description |
---|---|---|
anyscale.com/market-type | SPOT, ON_DEMAND | Users with workloads that support preemption may opt to run their workloads on spot node types through the compute config. All other workloads are run on on-demand node types. This should most likely be transformed into a node affinity. |
anyscale.com/zone | user-defined through cloud setup | For Pods that have a specific zone affinity, the Anyscale operator sets this label to the zone that the Pod should be launched into (us-west-2a , for example). Zones are provided as []string at cloud registration time and can be selected from the Anyscale UI. This should most likely be transformed into a node affinity. |
anyscale.com/accelerator-type | user-defined through instance type configuration | When requesting a GPU Pod, the Anyscale operator sets one of the following values: Anyscale accelerator types. |
anyscale.com/instance-type | user-defined through instance type configuration | The operator sets this value for all Pods created through Anyscale. |
anyscale.com/canary-weight anyscale.com/canary-exists anyscale.com/canary-svc anyscale.com/ingress-type anyscale.com/bearer-token anyscale.com/primary-weight anyscale.com/primary-svc | various | For advanced use only (when using an ingress other than NGINX for inference / serving workloads with Anyscale Services). Contact Anyscale for more details. |
Advanced: Compute configuration options
The compute configuration allows you to provide workload-scoped advanced configuration settings for either a specific node or the entire cluster. Anyscale applies this configuration as a strategic merge patch to the Pod specifications generated by Anyscale before sending them to the Kubernetes API.
Node-specific configurations override any cluster-wide configurations for that node type. For a full reference on configuring these properties, see the official Kubernetes documentation.
View a sample of common advanced configuration options.
{
"metadata": {
// Add a new label.
"labels": {"new-label": "example-value"},
// Add a new annotation.
"annotations": {"new-annotation": "example-value"}
},
"spec": {
// Add a node selector.
"nodeSelector": {"disktype": "ssd"},
"tolerations": [{
"effect": "NoSchedule",
"key": "dedicated",
"value": "example-anyscale"
}]
"containers": [{
// Add a PersistentVolumeClaim to the Ray container.
"name": "ray",
"volumeMounts": [{
"name": "pvc-volume",
"mountPath": "/mnt/pvc-data"
}]
},{
// Add a sidecar for exporting logs/metrics.
"name": "monitoring-sidecar",
"image": "timberio/vector:latest",
"ports": [{
"containerPort": 9000
}],
"volumeMounts": [{
"name": "vector-volume",
"mountPath": "/mnt/vector-data"
}]
}],
"volumes": [{
"name": "pvc-volume",
"persistentVolumeClaim": {
"claimName": "my-pvc"
}
},{
"name": "vector-volume",
"emptyDir": {}
}]
}
}
Uninstall the Anyscale operator
View uninstallation instructions
To uninstall the Anyscale operator, run the following command:
helm uninstall <release-name> -n <namespace>
kubectl delete namespace <namespace>
To delete the cloud, run the following command:
anyscale cloud delete --name <cloud-name>