Frequently Asked Questions
How do I use Tune Checkpointing with Anyscale Jobs?
We recommend using cloud storage (S3 or GCS) to synchronize Tune Checkpoints across Anyscale Jobs.
- Make sure you can access an S3 bucket (S3 guide) or GCS bucket (GCS guide) from your cluster.
- Set the
upload_dirconfiguration in the Tune
SyncConfigto point to your bucket. You should use an
from ray import tune
from ray.air.config import RunConfig
upload_path = "s3://my-anyscale-bucket/tune-checkpoint/"
tuner = tune.Tuner(
"PPO", # This can be any Tune Trainable
upload_dir=upload_path # Set the upload path here
result = tuner.fit()
When your Anyscale Job retries, it will automatically load from the last checkpoint in the directory you specified. When your Anyscale Job completes, you can view the checkpoint directory in the output of the Job.
How do I make Anyscale Jobs wait for a cluster to obtain min_nodes before executing my workload?
In some scenarios, you may want your job to wait for some minimum number of nodes to be spun up before launching. If those nodes are not ready after a certain period of time, we will terminate the cluster or job.
We are working on building this in as a supported feature on the platform. Until then, here is a simple script that you can use to enable this behavior.
1. Download wait-for-nodes.py script
Download the wait-for-nodes.py script to your working dir. This script takes two arguments:
- The first argument is the max number of seconds to wait for before terminating the cluster
- The second argument is optional and takes in the minumum number of nodes to wait for before proceeding. If no argument is provided, it calculates the min number of nodes based on the min_nodes requested for each node type + 1 for the head noded.
2. Update entry point to use this script.
- Production Jobs
- Ray Jobs
entrypoint: python wait-for-nodes.py 300 && python my-app.py
$ RAY_ADDRESS=anyscale://cluster-1 ray job submit --working_dir . -- python wait-for-nodes.py 300 && python my-app.py