Declarative compute configs
Declarative compute configs
This page provides an overview of using the required_resources field for declarative compute configuration.
Instead of selecting from predefined Kubernetes pod shapes, use this syntax to specify the CPU, memory, GPU, and accelerator resources your workload needs. Anyscale provisions pods that satisfy those requirements for head and worker nodes in your Anyscale cluster.
This feature is in beta release for Anyscale cloud resources configured on the Kubernetes stack. Declarative compute configs are available through CLI and SDK only and require Anyscale CLI version 0.26.82 or later.
When to use declarative compute configs
With traditional compute configs, you select from a list of predefined instance types. On Kubernetes, these instance types are pod shapes that a Kubernetes admin defines in the Helm chart. On VMs, these are cloud provider instance types such as m5.2xlarge or n2-standard-8.
Declarative compute configs offer an alternative approach. Instead of selecting an instance type, you specify your resource requirements directly, such as the following example:
worker_nodes:
- name: gpu-workers
required_resources:
CPU: 8
memory: 32Gi
GPU: 1
required_labels:
ray.io/accelerator-type: A100
This approach is useful when you want to:
- Request exactly the resources your workload needs.
- Target specific accelerators without managing instance type definitions.
- Avoid over-provisioning by specifying precise requirements.
How declarative compute configs work
- You specify the resources your workload needs using
required_resources. - Optionally, you target specific hardware using
required_labels. - Anyscale provisions nodes that satisfy those requirements.
- Autoscaling works as usual, adding nodes up to
max_nodesas needed.
Each node group must use either instance_type or required_resources, not both. You can mix approaches in the same compute config. For example, you can use instance_type for the head node and required_resources for worker groups.
Define resource requirements
Define resource requirements in your compute configuration using required_resources and, optionally, required_labels. For full compute config syntax, see Compute Config API Reference.
Anyscale support for TPUs requires cloud resources backed by GKE. See Leverage Cloud TPUs on GKE.
required_resources
Use required_resources to specify CPU, memory, and accelerator requirements for a node group.
| Resource | Type | Description | Example |
|---|---|---|---|
| CPU | Integer | Number of CPU cores. | CPU: 8 |
| memory | String | Memory with unit (Gi, Mi, G, M). | memory: 16Gi |
| GPU | Integer | Number of GPUs. | GPU: 1 |
| TPU | Integer | Number of TPU chips. | TPU: 4 |
| tpu_hosts | Integer | Number of TPU hosts. Required for TPU workloads. | tpu_hosts: 1 |
required_labels
Use required_labels to target specific hardware such as GPU types or TPU topologies.
| Label | Description | Example values |
|---|---|---|
ray.io/accelerator-type | GPU or accelerator type. | T4, A10G, A100, L4, H100, TPU-V4, TPU-V5E, TPU-V6E |
ray.io/tpu-topology | TPU topology. Required for TPU workloads. | 2x2, 2x4, 4x4 |
Examples
The following examples demonstrate declarative compute configs for common workloads.
- Basic CPU
- GPU
- Mixed CPU/GPU
- High memory
- Multi-GPU
- TPU
Request CPU and memory resources for a compute workload.
compute_config:
head_node:
required_resources:
CPU: 4
memory: 8Gi
worker_nodes:
- name: cpu-workers
required_resources:
CPU: 8
memory: 16Gi
min_nodes: 1
max_nodes: 10
Target specific accelerator types using required_labels.
compute_config:
head_node:
required_resources:
CPU: 4
memory: 8Gi
worker_nodes:
- name: t4-workers
required_resources:
CPU: 7
memory: 12Gi
GPU: 1
required_labels:
ray.io/accelerator-type: T4
min_nodes: 1
max_nodes: 5
- name: a100-workers
required_resources:
CPU: 15
memory: 64Gi
GPU: 1
required_labels:
ray.io/accelerator-type: A100
min_nodes: 0
max_nodes: 2
Combine CPU-only worker groups for preprocessing with GPU workers for training.
compute_config:
head_node:
required_resources:
CPU: 4
memory: 8Gi
worker_nodes:
- name: cpu-preprocessors
required_resources:
CPU: 16
memory: 32Gi
min_nodes: 2
max_nodes: 10
- name: gpu-trainers
required_resources:
CPU: 8
memory: 32Gi
GPU: 1
required_labels:
ray.io/accelerator-type: A10G
min_nodes: 1
max_nodes: 4
Configure high-memory workers for data processing workloads.
compute_config:
head_node:
required_resources:
CPU: 2
memory: 4Gi
worker_nodes:
- name: high-memory-workers
required_resources:
CPU: 32
memory: 128Gi
min_nodes: 4
max_nodes: 20
Request multiple GPUs per worker for distributed training.
compute_config:
head_node:
required_resources:
CPU: 8
memory: 16Gi
worker_nodes:
- name: a100-workers
required_resources:
CPU: 30
memory: 200Gi
GPU: 4
required_labels:
ray.io/accelerator-type: A100
min_nodes: 1
max_nodes: 2
Configure TPU workers for JAX or TensorFlow workloads. Requires a GKE cluster with TPU node pools.
compute_config:
head_node:
required_resources:
CPU: 4
memory: 16Gi
worker_nodes:
- name: tpu-v6e-2x2
required_resources:
CPU: 7
memory: 12Gi
TPU: 4
tpu_hosts: 1
required_labels:
ray.io/accelerator-type: TPU-V6E
ray.io/tpu-topology: 2x2
min_nodes: 1
max_nodes: 5
market_type: SPOT
- name: tpu-v6e-2x4
required_resources:
CPU: 14
memory: 24Gi
TPU: 8
tpu_hosts: 1
required_labels:
ray.io/accelerator-type: TPU-V6E
ray.io/tpu-topology: 2x4
min_nodes: 0
max_nodes: 2
market_type: SPOT
You must set the tpu_hosts and ray.io/tpu-topology fields for TPU workloads. Ensure your project has sufficient TPU quota and matching node pools.
Kubernetes considerations
On Kubernetes, declarative compute configs bypass the instance types defined in the Helm chart for the Anyscale operator. You don't need to define instance types in the Helm chart for worker groups that use required_resources. Anyscale dynamically defines new pod shapes, eliminating the need to patch and redeploy the Helm chart.
Kubernetes schedules pods onto nodes that can satisfy the requested resources. Ensure your Kubernetes cluster has nodes with sufficient capacity and the required accelerators available.
For general Kubernetes compute config options such as tolerations, node selectors, and volume mounts, see Compute configuration options for Kubernetes.
Best practices
- Profile workloads to understand actual resource needs.
- Leave headroom for Ray overhead (CPU and memory).
- Use meaningful worker group names for observability.
- Test autoscaling with small minimums first.
- Monitor real usage and refine resource specifications over time.
Troubleshooting
Kubernetes Pods stuck in pending
- Ensure cluster nodes can satisfy requested resources.
- Verify required labels match node labels.
- Confirm requests don't exceed single-node capacity.
Pods not using expected hardware
- Verify accelerator labels on nodes.
- Check label spelling and capitalization.
TPU pods not starting
- Confirm you've set
tpu_hostsandray.io/tpu-topology. - Ensure TPU count matches topology.
- Verify TPU quota and node pool configuration.