Skip to main content

Create and manage Anyscale Jobs

note

Use of Anyscale Jobs requires Ray 2.0+.

Anyscale Jobs are discrete workloads managed by Anyscale. A developer designs and packages code for a Job, and then submits the Job to Anyscale for execution and cluster lifecycle management. It is best suited for workflows where you want Anyscale to handle starting up the cluster and react to failures.

Lifecycle of an Anyscale Job:

  1. Creates a cluster
  2. Runs the Ray Job on it
  3. Restarts the Ray Job on failure (up to the max_retries)
  4. Records the output and send an email on success or failure.

Define Anyscale Jobs

Jobs need to be defined in a YAML file like this:

name: "my-first-job"
project_id: "prj_7S7Os7XBvO6vdiVC1J0lgj"
compute_config: my-compute-config # You may specify `compute_config_id` or `cloud` instead
# Alternatively, a one-off compute config
# compute_config:
# cloud_id: cld_4F7k8814aZzGG8TNUGPKnc
# region: us-west-2
# head_node_type:
# name: head
# instance_type: m5.large
# worker_node_types: []
cluster_env: my-cluster-env:5 # You may specify `build_id` instead
runtime_env:
working_dir: "s3://my_bucket/my_job_files.zip"
# You may also specify other runtime environment properties like `pip` and `env_vars`
# pip: "./requirements.txt" # relative path to the local directory where `anyscale job submit` is run
# pip:
# - pandas
# - torch
# env_vars:
# SECRETE: "xyz"
entrypoint: "python my_job_script.py --option1=value1"
max_retries: 3

Here are some of the fields to provide in YAML files:

  • compute_config A Compute Config for the cluster the Job will run on.
    • On the SDK, this can be specified as compute_config_id or compute_config (a one-off compute config). This is required, and only one of these fields can be specified.
    • On the CLI, you may specify compute_config (the name of the cluster compute config or a one-off) or cloud (the name of an Anyscale cloud) instead for convenience. Both attributes are optional. If neither attribute is specified, the service will use a default compute config that is associated with the default cloud.
  • (Required) cluster_env A Cluster Environment for the cluster the Job will run on.
    • On the SDK, this can only be specified as build_id.
    • On the CLI, you may specify cluster_env (the name and version for the cluster environment, colon-separated; if you don't specify a version, the latest will be used). This attribute is optional in the CLI. The SDK example below shows how to resolve a missing value, or a cluster_env into a build_id.
  • (Optional) runtime_env A runtime environment containing your application code and dependencies.
  • (Required) entrypoint The entrypoint command that will be run on the cluster to run the job. Although it is generally a python script, the entry point can be any shell command.
  • (Optional) max_retries A maximum number of retries before the Job is considered failed (defaults to 5).
  • (Optional) name Name of the Job. Job names can be the same within a project and across projects.
  • (Optional) project_id The id of the Project you want the Job to run in. project_id can be found in the URL by navigating to the project in Anyscale Console. If not specified, Job will not belong to any Project.
info

For large-scale Jobs, it's recommended to set the CPU resource on the head node to 0 in your Compute Config. This prevents Ray Actors and Tasks from being scheduled on the head node, and protect your cluster from any application resource leaks.

note

The working_dir option of the runtime_env can be specified in two ways:

  • Remote URI: A zip file in a cloud storage bucket (AWS S3 or Google Cloud Storage) or directly accessible over HTTP/S (for example, a GitHub download URL). The zip file must contain only a single top-level directory. See Remote URIs for details.

  • Local directory: A local directory that will be uploaded to a remote storage and downloaded by the cluster before running the Anyscale Job. The external storage location is specified using the upload_path field. For example, the upload_path could be an Amazon S3 or Google Cloud Storage bucket. You are responsible for ensuring that your local environment and the future cluster has network access and IAM permissions to the remote location specified. External storage allows for Anyscale to start a new cluster with your working_dir in the case of a Job failure.

Example: runtime_env = {"working_dir": "/User/code", "upload_path": "s3://my-bucket/subdir"}

Whether using the first or second option, the cluster running the job must have permissions to download from the bucket or URL. For more on permissions, see accessing resources from cloud providers. The upload_path field is not available in OSS Ray. See Runtime Environments in Anyscale Jobs and Services for more differences with runtime_env when using Anyscale Jobs and Services.

Submit and run

To submit your Job to Anyscale, use the CLI or the Python SDK:

anyscale job submit my_production_job.yaml --follow \
--name my-production-job \
--description "A production job running on Anyscale."
info

Anyscale also supports scheduling jobs for recurring workloads. To schedule a Job, check out Anyscale Schedules

Monitor

You can check the status of a Job on the Web UI, or query it using the CLI or the Python SDK:

anyscale job list --job-id [JOB_ID]

View logs

You can view logs of a Job on the Web UI, or follow it using the CLI or the Python SDK:

anyscale job logs --job-id <JOB_ID> --follow

Terminate

You can terminate a Job using the CLI or the Python SDK:

anyscale job terminate --job-id [JOB_ID]

Job Timeout

To set a timeout for your Anyscale Job, set the maximum_uptime_minutes in the Compute Config. The maximum_uptime_minutes will be applied for each individual Job Attempt. Upon hitting the maximum_uptime_minutes, the job will retry if there are remaining retries left.

Archive

Anyscale allows you to archive Jobs and hide them from list view. The clusters created by the archived Job will be archived automatically. Once a Job is archived, you will still be able to view its details.

Archive a Job

  • The Jobs have to be inactive to be archived. Inactive states for Jobs are: Terminated, Out of Retries, Broken, and Success.
  • The user must have "write" permission for the Job to archive it.

You can archive Jobs on the Web UI, or through the CLI:

anyscale job archive --job-id [JOB_ID]

To view archived Jobs in the list view

You can view archived Jobs by toggling on the "include archived" filter on the Web UI, or through the CLI:

anyscale job list --include-archived

Un-archive

The un-archive feature is currently under development and please reach out to support if you would like to be an early adopter.