Troubleshoot slow cluster startup

Cluster startup on the Kubernetes compute stack can be slow when large container images need to be pulled to new nodes. Best-case startup times are similar to the virtual machine stack, but worst-case times can exceed 10 minutes without clear progress indicators.

This guide covers the most common cause and solutions.

Slow image pull

Large container images (10 GB or larger) result in slow pod startup because the kubelet must pull a significant amount of data. This problem is worse when workload pods run on different nodes, since they can't reuse a local node image cache. This is common with aggressive autoscalers such as Karpenter or with workloads that require GPU or TPU nodes.

Use image streaming

Image streaming lazy-loads large images by pulling only the minimum amount of data before startup and providing access to the rest at runtime through filesystem mounts. In testing, image streaming reduced image pull time from approximately 3 minutes to 25 seconds.

Image streaming support varies by cloud provider:

Google Kubernetes Engine (GKE): Host images in Artifact Registry. Image streaming is automatic.
Azure Kubernetes Service (AKS): Host images in Azure Container Registry (ACR). Image streaming is automatic.
Amazon EKS and other Kubernetes setups: Use the SOCI snapshotter plugin for containerd.

note

Image streaming requires hosting images in a cloud provider registry that supports it. Images hosted in third-party or self-hosted registries may not support image streaming.

Anyscale recommends enabling image streaming for most Kubernetes deployments.

Peer-to-peer (P2P) file sharing tools such as Dragonfly and Kraken increase image download throughput. Nodes pull from in-cluster peers rather than from the source registry, which particularly helps when new nodes spin up frequently with GPU workloads or aggressive autoscaling.

P2P file sharing requires additional infrastructure setup and management on your cluster. Evaluate whether the performance benefit justifies the operational overhead for your environment.

If startup consistently exceeds 10 minutes

If cluster startup consistently takes longer than 10 minutes and you've already optimized image pull times, contact Anyscale support with your cluster ID and startup time metrics.

Slow image pull​

Use image streaming​

Use P2P file sharing​

If startup consistently exceeds 10 minutes​

Slow image pull

Use image streaming

Use P2P file sharing

If startup consistently exceeds 10 minutes