Skip to main content

Scale uv dependency installations on large Anyscale clusters

Scale uv dependency installations on large Anyscale clusters

This article describes how to avoid contention and registration failures when uv installs Python dependencies on large Ray clusters. Apply these techniques when your cluster has many nodes, each node has many CPUs, or both. For general guidance on using uv with Anyscale, see Use uv to manage Python dependencies.

Symptoms

On large clusters, you might observe one or more of the following behaviors during worker startup:

  • Slow startup as many workers redownload and reinstall the same packages in parallel.
  • Race conditions, file corruption, or partial installations in the shared uv cache or temporary directory.
  • Worker process registration timeouts.
  • Ray Train actor placement failures when workers can't register before the default timeout.

Cause

The Ray uv runtime environment plugin initializes a separate uv run for each Ray worker process. When a cluster has many workers, this design produces the following problems:

  • Workers trigger parallel downloads and installations of the same packages, wasting bandwidth and CPU cycles.
  • uv uses a shared cache and temporary directory on each node. Concurrent writes cause race conditions and file corruption.
  • Worker registration exceeds the default timeout while packages download and install.
  • Ray Train's internal actor placement timeout expires before workers finish registering.

Solutions

The following approaches address different aspects of the scaling problem. You can apply them independently or combine them.

Use a custom py_executable to install once per node

Configure a custom py_executable that designates one worker per node to download and install packages. Subsequent workers reuse the prepared environment. This eliminates duplicate installations and reduces contention on the shared cache.

The following Anyscale job configuration sets py_executable to a bash script that coordinates installation with a file lock:

job_config = anyscale.job.JobConfig(
name=job_name,
image_uri="anyscale/image/uv-flock:1",
env_vars={
"UV_PROJECT_ENVIRONMENT": "/home/ray/anaconda3",
"RAY_ENABLE_UV_RUN_RUNTIME_ENV": "0",
},
entrypoint="uv run --inexact --verbose --frozen main.py",
compute_config=compute_config,
py_executable="bash ./puv.sh",
)

Place a puv.sh script in your working directory with the following contents:

#!/usr/bin/env bash
set -euo pipefail

LOCK_FILE="/tmp/uv_sync.lock"
DONE_FILE="/tmp/uv_sync.done"

export UV_PROJECT_ENVIRONMENT="/home/ray/anaconda3"

(
flock 9
if [ ! -f "$DONE_FILE" ]; then
echo "[INFO] First worker running: uv sync $*"
uv sync --verbose --inexact
echo "[INFO] Sync complete, marking done."
touch "$DONE_FILE"
exit 0
fi
) 9>"$LOCK_FILE"

# If not the first, wait until the first is done.
while [ ! -f "$DONE_FILE" ]; do
echo "[INFO] Waiting for uv sync to complete..."
sleep 1
done

echo "[INFO] Sync done. Running: uv run $*"
uv run --verbose --inexact "$@"

The script uses flock to serialize access. The first worker to acquire the lock runs uv sync and writes a marker file. Other workers wait until the marker appears, then run their command against the prepared environment.

Mount a high-speed shared uv cache

For clusters with many nodes, point uv at a centralized cache on a high-performance shared file system such as Amazon FSx for OpenZFS. A shared cache reduces each package to a single download across the cluster and serves subsequent installations from cached artifacts over high-throughput storage.

Use an init script to mount the shared file system on every node before Ray starts, then configure uv to use the mounted path as its cache directory. A shared cache keeps environments consistent across hundreds of workers and minimizes network overhead.

Extend worker registration and start timeouts

uv package installations can exceed the default worker registration timeout. Two timeouts apply:

  • RAY_worker_register_timeout_seconds controls how long Ray waits for a worker process to register.
  • RAY_TRAIN_WORKER_GROUP_START_TIMEOUT_S controls how long Ray Train waits for actor placement. The default is 30 seconds.

To extend the worker registration timeout, set the following environment variable on your cluster:

RAY_worker_register_timeout_seconds=600

To extend the Ray Train actor placement timeout, set the following environment variable in your training script:

import os
os.environ["RAY_TRAIN_WORKER_GROUP_START_TIMEOUT_S"] = "600"
caution

Setting these timeouts too high delays task recovery when workers fail. Choose a value that covers your longest expected install without masking real failures.