Skip to main content

Running an Anyscale Job only when min_nodes is reached

In some scenarios, you may want your job to wait for some minimum number of nodes to be spun up before launching. If those nodes are not ready after a certain period of time, we will terminate the cluster or job.

We are working on building this in as a supported feature on the platform. Until then, here is a simple script that you can use to enable this behavior.

Instructions

1. Download wait-for-nodes.py script

Download the wait-for-nodes.py script to your working dir. This script takes two arguments:

  1. The first argument is the max number of seconds to wait for before terminating the cluster
  2. The second argument is optional and takes in the minumum number of nodes to wait for before proceeding. If no argument is provided, it calculates the min number of nodes based on the min_nodes requested for each node type + 1 for the head noded.

2. Update entry point to use this script.

Update the entry point for your production job or your ray job to run the wait-for-nodes.py script first.

Example for production jobs:

job.yaml:

entrypoint: python wait-for-nodes.py 300 && python my-app.py
working_dir: .
upload_path: s3://my-upload-bucket

Example for ray job:

$ RAY_ADDRESS=anyscale://cluster-1 ray job submit --working_dir . -- python wait-for-nodes.py 300 \&\& python my-app.py