Skip to main content

Frequently Asked Questions

How do I use Tune Checkpointing with Anyscale Jobs?

We recommend using cloud storage (S3 or GCS) to synchronize Tune Checkpoints across Anyscale Jobs.

  1. Make sure you can access an S3 bucket (S3 guide) or GCS bucket (GCS guide) from your cluster.
  2. Set the upload_dir configuration in the Tune SyncConfig to point to your bucket. You should use an s3:// or gs:// url.
tune_example.py
import anyscale
from ray import tune
from ray.air.config import RunConfig

upload_path = "s3://my-anyscale-bucket/tune-checkpoint/"

tuner = tune.Tuner(
"PPO", # This can be any Tune Trainable
run_config=RunConfig(
name="experiment_name",
sync_config=tune.SyncConfig(
upload_dir=upload_path # Set the upload path here
)))

result = tuner.fit()
anyscale.job.output({
"tune_upload_path": upload_path
})

When your Anyscale Job retries, it will automatically load from the last checkpoint in the directory you specified. When your Anyscale Job completes, you can view the checkpoint directory in the output of the Job.

For more information about Tune Checkpoints, view the Tune examples. For more information about Anyscale Job outputs, you can view the reference.

How do I make Anyscale Jobs wait for a cluster to obtain min_nodes before executing my workload?

In some scenarios, you may want your job to wait for some minimum number of nodes to be spun up before launching. If those nodes are not ready after a certain period of time, we will terminate the cluster or job.

We are working on building this in as a supported feature on the platform. Until then, here is a simple script that you can use to enable this behavior.

Instructions

1. Download wait-for-nodes.py script

Download the wait-for-nodes.py script to your working dir. This script takes two arguments:

  1. The first argument is the max number of seconds to wait for before terminating the cluster
  2. The second argument is optional and takes in the minumum number of nodes to wait for before proceeding. If no argument is provided, it calculates the min number of nodes based on the min_nodes requested for each node type + 1 for the head noded.

2. Update entry point to use this script.

Update the entry point for your production job or your ray job to run the wait-for-nodes.py script first.

job.yaml
entrypoint: python wait-for-nodes.py 300 && python my-app.py
working_dir: .
upload_path: s3://my-upload-bucket