In some scenarios, you may want your job to wait for some minimum number of nodes to be spun up before launching. If those nodes are not ready after a certain period of time, we will terminate the cluster or job.
We are working on building this in as a supported feature on the platform. Until then, here is a simple script that you can use to enable this behavior.
1. Download wait-for-nodes.py script
Download the wait-for-nodes.py script to your working dir. This script takes two arguments:
- The first argument is the max number of seconds to wait for before terminating the cluster
- The second argument is optional and takes in the minumum number of nodes to wait for before proceeding. If no argument is provided, it calculates the min number of nodes based on the min_nodes requested for each node type + 1 for the head noded.
2. Update entry point to use this script.
Example for production jobs:
entrypoint: python wait-for-nodes.py 300 && python my-app.py
Example for ray job:
$ RAY_ADDRESS=anyscale://cluster-1 ray job submit --working_dir . -- python wait-for-nodes.py 300 \&\& python my-app.py