Skip to main content

Bring your own Docker environments

Anyscale cluster environments can be configured to launch with user-specified Docker images. This can be useful to:

  • Build images with dependencies and packages that aren't publicly available.
  • Keep images within your organization's account.
  • Leverage your existing CI/CD pipelines to build and manage Anyscale cluster environments.

The following diagram illustrates the relationship between the Anyscale control plane and your organization's data plane in a CI/CD pipeline:


Getting started

Prerequisites

  • A local installation of Docker (for building and pushing images).
  • (Optional) Anyscale CLI version 0.5.50 or higher, if you want to use the CLI to create cluster environments.
  • (Optional) Amazon ECR access set up, if you want to access images stored in a private ECR repository.

Step 1: Build an Anyscale compatible image

Anyscale provides public base images pre-installed with all the necessary dependencies to run Ray on Anyscale, for example anyscale/ray:2.9.3. A full list of base images and their dependencies can be found here. Once you've selected a base image, you can create a Dockerfile with additional dependencies:

# Use Anyscale base image
FROM anyscale/ray:2.9.3-py310

# Add extra dependencies
ARG DEBIAN_FRONTEND=noninteractive
RUN sudo apt-get update && sudo apt-get install -y axel nfs-common zip unzip awscli && sudo apt-get clean

RUN pip install --no-cache-dir -U sympy

# (Optional) Verify that dependencies from the base image still work. This
# is useful for catching dependency conflicts at build time.
RUN echo "Testing Ray Import..." && python -c "import ray"
RUN ray --version
RUN jupyter --version
RUN anyscale --version
RUN sudo supervisord --version

Once you've created your Dockerfile, you can build and tag it with:

docker build -t <your-registry>:<your-tag> .
info

The Anyscale base images come with a default entrypoint set. Overwriting this entrypoint may break the Web Terminal and Jupyter notebook server when you launch your cluster. See this section for details on bypassing this entrypoint when running the image locally.

info

If your image is based on an image with ray version 2.7.X or lower. See this section for details about apt-get update failures caused by legacy k8s repository.

Step 2: Push your image

Push your image to a docker registry. This is currently supported for the following registries:

  • Any publicly accessible registry. For example, Docker Hub with no auth.
  • Private cloud provider managed registries:
  • Private third-party registries (Docker Hub, JFrog Artifactory, etc...). See this guide for setting up access to third-party registries.

See the following guides for details on pushing images to Amazon ECR:

See the following guides for details on pushing images to Artifact Registry:

Step 3: Create a cluster environment for your image

Create a YAML configuration file like the following:

docker_image: my-registry/my-image:tag
ray_version: 2.9.3 # Replace this with the version of Ray in your image
env_vars: # Optionally, specify environment variables
MY_VAR: value
registry_login_secret: mysecretid # Optional, only needed for private third-party registries

Then, run the following:

anyscale cluster-env build -n <cluster-env-name> my_cluster_env.yaml

Step 4: Launch a workload with your image

Once you've created a cluster environment with your image, you can reference the environment when starting clusters or other workloads in your account.

Clusters

anyscale cluster start --env=<cluster-env-name>

Jobs and Services

You can specify your custom docker environment in the cluster_env field of the YAML configuration for your jobs and services.

Advanced

Init Scripts (Public Beta)

An init script is a shell script that runs inside of the Ray container on all nodes before Ray starts. Common use cases include:

  • Performing commands to fetch resources or other runtime dependencies
  • Installing container-based monitoring / security agents
  • Pre-job testing & verification for complex health-checks (for example, verifying network paths before starting jobs)

To add init scripts to your Docker image, write them into /anyscale/init when you build your image.

All output from init scripts is written into /tmp/ray/startup-actions.log. If init scripts fail to execute on a node, standard out & standard error will be shown in the Event Log for the cluster associated with your Job/Service/Workspace, and the node will be terminated.

Troubleshooting

Debugging cluster startup failures

To troubleshoot clusters that won't start up, start by looking in the cluster's Event Log for any helpful tips.

Debugging Ray container utilities (Jupyter, VS Code, Web Terminal)

To troubleshoot issues with utilities that are run inside of the Ray container, the following log files may be useful:

  • /tmp/ray/jupyter.log - Jupyter log
  • /tmp/ray/vscode.log - VS Code log
  • /tmp/ray/web_terminal_server.log - Web Terminal system log

If you are unable to access these log files through the Web Terminal, they are also accessible by downloading Ray logs for the cluster:

anyscale logs cluster --id [CLUSTER_ID] --download

Running the image locally

When doing docker run -it <your-image>, you may run into an error similar to the following:

Error: Format string '/home/ray/anaconda3/bin/anyscale session web_terminal_server --deploy-environment %(ENV_ANYSCALE_DEPLOY_ENVIRONMENT)s --cli-token %(ENV_ANYSCALE_CLI_TOKEN)s --host %(ENV_ANYSCALE_HOST)s --working-dir %(ENV_ANYSCALE_WORKING_DIR)s --session-id %(ENV_ANYSCALE_SESSION_ID)s' for 'program:web_terminal_server.command' contains names ('ENV_ANYSCALE_DEPLOY_ENVIRONMENT') which cannot be expanded. Available names: ENV_BUILD_DATE, ENV_HOME, ENV_HOSTNAME, ENV_LANG, ENV_LC_ALL, ENV_LOGNAME, ENV_PATH, ENV_PWD, ENV_PYTHONUSERBASE, ENV_RAY_USAGE_STATS_ENABLED, ENV_RAY_USAGE_STATS_PROMPT_ENABLED, ENV_RAY_USAGE_STATS_SOURCE, ENV_SHELL, ENV_SUDO_COMMAND, ENV_SUDO_GID, ENV_SUDO_UID, ENV_SUDO_USER, ENV_TERM, ENV_TZ, ENV_USER, group_name, here, host_node_name, process_num, program_name in section 'program:web_terminal_server' (file: '/etc/supervisor/conf.d/supervisord.conf')

This is caused by Anyscale's custom entrypoint, which requires certain environment variables set to work. To get around this, you can manually override the entrypoint when running the image with the following command:

docker run -it --entrypoint bash <your-image>

This will give you an interactive shell into the image locally.

Docker write: no space left on device

If you’re pulling a large image, you may run out of disk space on your nodes. You can work around this by configuring a larger volume in your compute config’s advanced options:

  1. Navigate to Configurations->Cluster compute configs in the Anyscale console.
  2. Select "Create new config"
  3. Pick a name for your compute config, and set Cloud name to the cloud corresponding you want to launch your workload in.
  4. Navigate to the Advanced configuration section near the bottom of the page.
  1. Add the following configuration to the Advanced configuration setting to attach a 250 GB volume (you can tune this to an appropriate size for your image).
{
"BlockDeviceMappings": [
{
"DeviceName": "/dev/sda1",
"Ebs": {
"VolumeSize": 250,
"DeleteOnTermination": true
}
}
]
}

Note that "DeleteOnTermination" should be set to true to clean up the volume after the instance is terminated.

Installing stable versions of Ray on top of nightly CUDA images

Older versions of Ray may not have base images available for newer versions of CUDA. In this scenario, you can use the nightly base images and reinstall a stable version of Ray on top of the nightly image. For example, to use CUDA 12.1 with Ray 2.5.0, you can create a Dockerfile similar to the following:

FROM anyscale/ray:nightly-py310-cu121

pip uninstall -y ray && pip install -U ray==2.5.0

If the version of CUDA you need isn't already supported in the nightly images, contact support.

docker: Error response from daemon: no basic auth credentials.

info

This section assumes that Anyscale nodes are launched into your account with the <cloud-id>-cluster_node_role role. If your nodes are being launched with ray-autoscaler-v1, or if you are using a custom AWS IAM role then you can apply the same steps to that role instead to grant ECR access.

This error can happen if the nodes launched in your account don't have permission to pull the image you specified. If you're using Amazon ECR to host your images, check that you've completed the Amazon ECR access set up steps. In particular, make sure that:

  • The <cloud-id>-cluster_node_role role has the AmazonEC2ContainerRegistryReadOnly policy attached.
  • The private ECR repository allows pulls from nodes with the <cloud-id>-cluster_node_role role. This is necessary if the private ECR repository is in a separate account from your EC2 instances.