Configure machine pools for Kubernetes
Configure machine pools for Kubernetes
This page describes how to configure an Anyscale machine pool for an Anyscale cloud deployed on Amazon EKS or Google GKE. The global resource scheduler manages machines across reserved node groups on your Kubernetes cluster using taints, tolerations, and patches applied by the Anyscale operator.
See Share compute resources with Anyscale machine pools and What is the global resource scheduler?.
Machine pools and the global resource scheduler are in beta release.
Anyscale supports machine pools on Kubernetes only for Anyscale clouds deployed on Amazon EKS or Google GKE. Anyscale doesn't support machine pools on AKS or other Kubernetes distributions.
How do machine pools work on Kubernetes?
On Kubernetes, a machine pool corresponds to one or more node groups that you dedicate to Anyscale workloads managed by the global resource scheduler. You taint those node groups with the machine pool ID so that only pods tolerating the taint can schedule on them.
The Anyscale operator applies an annotation with the machine pool ID to every pod that the global resource scheduler launches. You add a patch to the Anyscale operator that matches this annotation and injects the matching toleration. Anyscale workloads deployed through the global resource scheduler launch pods using the reserved node groups. Pods that request the same instance types outside the global resource scheduler don't receive the toleration and schedule on other available node groups.
The steps on this page show one approach to configuring taints and tolerations with operator patches. Your Kubernetes environment might require a different combination of taints, labels, and node selectors.
Requirements
Complete the following before configuring a machine pool on Kubernetes:
- An Anyscale cloud deployed on Amazon EKS or Google GKE with the Anyscale operator installed. See Anyscale on Kubernetes.
- Permissions to taint node groups and modify the Anyscale operator Helm values for your cluster.
- You must have the following permissions:
- Organization owner for your Anyscale organization.
- Cloud collaborator or cloud owner for your Anyscale cloud.
- The Anyscale CLI, authenticated to your Anyscale organization. See CLI configuration.
Step 1: Create the machine pool
Run the following command to create an empty machine pool and capture the machine pool ID from the output:
anyscale machine-pool create --name reserved-nodes
The machine pool ID has the format mp_<random-id>. Record this value. You use this ID when tainting node groups and configuring the operator.
Step 2: Attach the machine pool to your cloud
Attach the machine pool to the Anyscale cloud deployed on Kubernetes:
anyscale machine-pool attach --name reserved-nodes --cloud <cloud-name>
Step 3: Taint your reserved node groups
In your Kubernetes cluster, apply a taint to each node group you want to reserve for the global resource scheduler. Use anyscale-machine-pool-id as the taint key and the machine pool ID as the value:
key: anyscale-machine-pool-id
value: mp_123xyz
effect: NoSchedule
The exact mechanism depends on your cloud provider and node group tooling. Apply the taint through your managed node group configuration, Karpenter NodePool, Terraform module, or equivalent.
Anyscale recommends also setting anyscale-machine-pool-id: mp_123xyz as a label on the tainted node groups. This makes the node groups selectable by other tools in your cluster.
Step 4: Define custom instance types (optional)
If your reserved nodes use a shape that isn't in the Anyscale operator's default instance types, define a matching instance type in your Helm values file so that compute configs can request it. See Configure custom instance types.
The following example defines a 7CPU-16GB instance type:
workloads:
instanceTypes:
additional:
7CPU-16GB:
resources:
CPU: 7
memory: 16Gi
Step 5: Add a toleration patch for the machine pool
Add a patch to your Helm values file that injects the toleration for your machine pool taint. The operator applies this patch to every pod it launches through the global resource scheduler because those pods carry the anyscale.com/anyscale-machine-pool-id annotation. See Apply custom patches.
patches:
- kind: Pod
selector: "anyscale.com/anyscale-machine-pool-id in (mp_123xyz)"
patch:
- op: add
path: /spec/tolerations/-
value:
key: "anyscale-machine-pool-id"
operator: "Equal"
value: "mp_123xyz"
effect: "NoSchedule"
Apply the updated Helm values with helm upgrade. See Apply custom configurations for your Anyscale operator.
Pods that request the same instance types outside the global resource scheduler don't receive this annotation, so they don't receive the toleration. They schedule on other available node groups and never preempt nodes reserved for the machine pool.
Step 6: Apply the machine pool configuration
Create a machine pool configuration that describes the instance types and partitions the global resource scheduler should manage. The following example reserves four machines of type 7CPU-16GB and assigns priorities based on project:
# example-config.yaml
kind: ANYSCALE_MANAGED
machine_types:
- machine_type: RES_7CPU_16GB
launch_templates:
- instance_type: "7CPU-16GB"
market_type: ON_DEMAND
partitions:
- name: project-partition
size: 4
rules:
- selector: project in (high-prio)
priority: 1
- selector: project in (low-prio)
priority: 0
quota: 2
For the full set of configuration options, including rules and selectors, see Machine pool configuration file reference and Scheduling rules.
Apply the configuration:
anyscale machine-pool update --name reserved-nodes --spec-file example-config.yaml
Verify and use the machine pool
Run the following command to confirm the machine pool is ready:
anyscale machine-pool describe --name reserved-nodes
To run workloads against the machine pool, reference the machine type and pool name in your compute config. See Configure an Anyscale-managed machine pool.