Create and manage Anyscale Jobs
Use of Anyscale Jobs requires Ray 2.0+.
Anyscale Jobs are discrete workloads managed by Anyscale. A developer designs and packages code for a Job, and then submits the Job to Anyscale for execution and cluster lifecycle management. It is best suited for workflows where you want Anyscale to handle starting up the cluster and react to failures.
Lifecycle of an Anyscale Job:
- Creates a cluster
- Runs the Ray Job on it
- Restarts the Ray Job on failure (up to the
max_retries
) - Records the output and send an email on success or failure.
Define Anyscale Jobs
Jobs need to be defined in a YAML file like this:
name: "my-first-job"
project_id: "prj_7S7Os7XBvO6vdiVC1J0lgj"
compute_config: my-compute-config # You may specify `compute_config_id` or `cloud` instead
# Alternatively, a one-off compute config
# compute_config:
# cloud_id: cld_4F7k8814aZzGG8TNUGPKnc
# region: us-west-2
# head_node_type:
# name: head
# instance_type: m5.large
# worker_node_types: []
cluster_env: my-cluster-env:5 # You may specify `build_id` instead
runtime_env:
working_dir: "s3://my_bucket/my_job_files.zip"
# You may also specify other runtime environment properties like `pip` and `env_vars`
# pip: "./requirements.txt" # relative path to the local directory where `anyscale job submit` is run
# pip:
# - pandas
# - torch
# env_vars:
# SECRETE: "xyz"
entrypoint: "python my_job_script.py --option1=value1"
max_retries: 3
Here are some of the fields to provide in YAML files:
compute_config
A Compute Config for the cluster the Job will run on.- On the SDK, this can be specified as
compute_config_id
orcompute_config
(a one-off compute config). This is required, and only one of these fields can be specified. - On the CLI, you may specify
compute_config
(the name of the cluster compute config or a one-off) orcloud
(the name of an Anyscale cloud) instead for convenience. Both attributes are optional. If neither attribute is specified, the service will use a default compute config that is associated with the default cloud.
- On the SDK, this can be specified as
- (Required)
cluster_env
A Cluster Environment for the cluster the Job will run on.- On the SDK, this can only be specified as
build_id
. - On the CLI, you may specify
cluster_env
(the name and version for the cluster environment, colon-separated; if you don't specify a version, the latest will be used). This attribute is optional in the CLI. The SDK example below shows how to resolve a missing value, or acluster_env
into abuild_id
.
- On the SDK, this can only be specified as
- (Optional)
runtime_env
A runtime environment containing your application code and dependencies. - (Required)
entrypoint
The entrypoint command that will be run on the cluster to run the job. Although it is generally a python script, the entry point can be any shell command. - (Optional)
max_retries
A maximum number of retries before the Job is considered failed (defaults to 5). - (Optional)
name
Name of the Job. Job names can be the same within a project and across projects. - (Optional)
project_id
The id of the Project you want the Job to run in.project_id
can be found in the URL by navigating to the project in Anyscale Console. If not specified, Job will not belong to any Project.
For large-scale Jobs, it's recommended to set the CPU resource on the head node to 0 in your Compute Config. This prevents Ray Actors and Tasks from being scheduled on the head node, and protect your cluster from any application resource leaks.
The working_dir
option of the runtime_env
can be specified in two ways:
- Remote URI: A zip file in a cloud storage bucket (AWS S3 or Google Cloud Storage) or directly accessible over HTTP/S (for example, a GitHub download URL). The zip file must contain only a single top-level directory. See Remote URIs for details.
- Local directory: A local directory that will be uploaded to a remote storage and downloaded by the cluster before running the Anyscale Job. The external storage location is specified using the
upload_path
field. For example, theupload_path
could be an Amazon S3 or Google Cloud Storage bucket. You are responsible for ensuring that your local environment and the future cluster has network access and IAM permissions to the remote location specified. External storage allows for Anyscale to start a new cluster with yourworking_dir
in the case of a Job failure.
Example:
runtime_env = {"working_dir": "/User/code", "upload_path": "s3://my-bucket/subdir"}
Whether using the first or second option, the cluster running the job must have permissions to download from the bucket or URL. For more on permissions, see
accessing resources from cloud providers.
The upload_path
field is not available in OSS Ray. See Runtime Environments in Anyscale Jobs and Services for more differences with runtime_env
when using Anyscale Jobs and Services.
Submit and run
To submit your Job to Anyscale, use the CLI or the Python SDK:
- CLI
- Python SDK
anyscale job submit my_production_job.yaml --follow \
--name my-production-job \
--description "A production job running on Anyscale."
import yaml
from anyscale.sdk.anyscale_client.models import *
from anyscale import AnyscaleSDK
sdk = AnyscaleSDK()
job_config = {
# IDs can be found on Anyscale Console under Configurations.
# The IDs below are examples and should be replaced with your own IDs.
'compute_config_id': 'cpt_U8RCfD7Wr1vCD4iqGi4cBbj1',
# The compute config can also specified as a one-off instead:
# 'compute_config': ClusterComputeConfig(
# cloud_id="cld_V1U8Jk3ZgEQQbc7zkeBq24iX",
# region="us-west-2",
# head_node_type=ComputeNodeType(
# name="head",
# instance_type="m5.large",
# ),
# worker_node_types=[],
# ),
# The id of the cluster env build
'build_id': 'bld_1277XIinoJmiM8Z3gNdcHN',
'runtime_env': {
'working_dir': 's3://my_bucket/my_job_files.zip'
},
'entrypoint': 'python my_job_script.py --option1=value1',
'max_retries': 3
}
job = sdk.create_job(CreateProductionJob(
name="my-production-job",
description="A production job running on Anyscale.",
# project_id can be found in the URL by navigating to the project in Anyscale Console
project_id='prj_7S7Os7XBvO6vdiVC1J0lgj',
config=job_config
))
Anyscale also supports scheduling jobs for recurring workloads. To schedule a Job, check out Anyscale Schedules
Monitor
You can check the status of a Job on the Web UI, or query it using the CLI or the Python SDK:
- CLI
- Python SDK
anyscale job list --job-id [JOB_ID]
from anyscale import AnyscaleSDK
sdk = AnyscaleSDK()
job = sdk.get_production_job(JOB_ID)
View logs
You can view logs of a Job on the Web UI, or follow it using the CLI or the Python SDK:
- CLI
- Python SDK
anyscale job logs --job-id <JOB_ID> --follow
from anyscale import AnyscaleSDK
sdk = AnyscaleSDK()
job_logs = sdk.get_production_job_logs(JOB_ID)
Terminate
You can terminate a Job using the CLI or the Python SDK:
- CLI
- Python SDK
anyscale job terminate --job-id [JOB_ID]
from anyscale import AnyscaleSDK
sdk = AnyscaleSDK()
sdk.terminate_job(JOB_ID)
Job Timeout
To set a timeout for your Anyscale Job, set the maximum_uptime_minutes
in the Compute Config. The maximum_uptime_minutes
will be applied for each individual Job Attempt. Upon hitting the maximum_uptime_minutes
, the job will retry if there are remaining retries left.
Archive
Anyscale allows you to archive Jobs and hide them from list view. The clusters created by the archived Job will be archived automatically. Once a Job is archived, you will still be able to view its details.
Archive a Job
- The Jobs have to be inactive to be archived. Inactive states for Jobs are:
Terminated
,Out of Retries
,Broken
, andSuccess
. - The user must have "write" permission for the Job to archive it.
You can archive Jobs on the Web UI, or through the CLI:
- CLI
anyscale job archive --job-id [JOB_ID]
To view archived Jobs in the list view
You can view archived Jobs by toggling on the "include archived" filter on the Web UI, or through the CLI:
- CLI
anyscale job list --include-archived
Un-archive
The un-archive feature is currently under development and please reach out to support if you would like to be an early adopter.