Create Anyscale Jobs
This version of the Anyscale docs is deprecated. Go to the latest version for up to date information.
Use of Anyscale Jobs requires Ray 2.0+.
Typical lifecycle of an Anyscale Job looks like following:
- Provisions a cluster (based on provided compute configuration)
- Runs specified entrypoint command (typically a Ray job)
- Restarts jobs in case of failures (up to the
max_retries
) - Records the output sending notifications with the results of the job's run.
Defining Anyscale Job
Jobs could either be submitted using CLI or SDK. To submit job using CLI, job configuration need to be specified inside YAML config file like following:
# (Required) User-provided identifier for the job
name: my-first-job
# (Required) Job's entrypoint
entrypoint: "python my_job_script.py --some-config=value"
# (Optional) Anyscale Project job will be associated with
# Project could be identified using user-provided name or internal project id (of the form `prj_...`)
project_id: my-project
# (Optional) Compute Config specifies configuration of the cluster (node types, min/max # of nodes, etc) job will be run on.
# Compute config could be identified using user-provided name or internal compute config id (of the form `cpt_...`)
compute_config: my-compute-config
# Alternatively, a one-off compute config can be specified inline like following:
# compute_config:
# cloud_id: cld_4F7k8814aZzGG8TNUGPKnc
# region: us-west-2
# head_node_type:
# name: head
# instance_type: m5.large
# worker_node_types: []
# (Optional) Cluster Environment specifies a (Docker-like) container image that will be used to execute the job.
# Cluster environment could be identified using user-provided name and version or internal build id (of the form `bld_...`)
cluster_env: my-cluster-env:5
# (Optional) Ray's Runtime Environment configuration is specified as is under `runtime_env` branch:
# https://docs.ray.io/en/latest/ray-core/api/doc/ray.runtime_env.RuntimeEnv.html
runtime_env:
working_dir: "s3://my_bucket/my_job_files.zip"
# (Optional) Maximum number of retries for a job
max_retries: 3
Here are some of the fields to provide in YAML files:
- (Required)
name
Name of the Job. Job names can be the same within a project and across projects. - (Required)
entrypoint
The entrypoint command to run on the cluster to start the job (entry point can be any shell command) - (Optional)
project_id
The id of the Project you want the Job to run in.project_id
can be found in the URL by navigating to the project in Anyscale Console. If not specified, Job will not belong to any Project. - (Optional)
compute_config
A Compute Config of the cluster the Job will be executed on.- Using SDK, this can be specified as either
compute_config_id
orcompute_config
(a one-off compute config). This is required, and only one of these fields can be specified. - Using CLI, you may specify
compute_config
as a) the name of the compute config, b) Anyscale's internal compute config id or c) inline definition. If not provided, default compute config will be used.
- Using SDK, this can be specified as either
- (Optional)
cluster_env
A Cluster Environment of the cluster the Job will run on.- Using SDK, this needs be specified as
build_id
. - Using CLI, you may specify
cluster_env
as a) the name and version of the cluster environment (colon-separated; if you don't specify a version, the latest will be used) or b) Anyscale's internal cluster environment id.- Note that, this attribute is optional in the CLI, but currently needs to be specified when using the SDK. Example below shows how to resolve a missing value, or a
cluster_env
into abuild_id
.
- Note that, this attribute is optional in the CLI, but currently needs to be specified when using the SDK. Example below shows how to resolve a missing value, or a
- Using SDK, this needs be specified as
- (Optional)
runtime_env
Ray's runtime environment configuration. - (Optional)
max_retries
Number of retries in case of failures encountered during Job execution (defaults to 5).
Please note that, for large-scale, compute-intensive Jobs, it's recommended to avoid scheduling Ray tasks onto the Ray's Head node to avoid interference with the Ray's control plane. To do that, set the CPU resource on the head node to 0 in your Compute Config.
This prevents Ray's Actors and Tasks from being scheduled on the Head node, avoiding potential interference with Ray's control plane for most intensive jobs.
The working_dir
option of the runtime_env
can be specified in two ways:
-
Remote URI: A zip file in a cloud storage bucket (AWS S3 or Google Cloud Storage) or directly accessible over HTTP/S (for example, a GitHub download URL). The zip file must contain only a single top-level directory. See Remote URIs for details.
-
Local directory: A local directory that will be uploaded to a remote storage and downloaded by the cluster before running the Anyscale Job. The external storage location is specified using the
upload_path
field. For example, theupload_path
could be an Amazon S3 or Google Cloud Storage bucket. You are responsible for ensuring that your local environment and the future cluster has network access and IAM permissions to the remote location specified. External storage allows for Anyscale to start a new cluster with yourworking_dir
in the case of a Job failure.
Example:
runtime_env = {"working_dir": "/User/code", "upload_path": "s3://my-bucket/subdir"}
Whether using the first or second option, the cluster running the job must have permissions to download from the bucket or URL. For more on permissions, see
accessing resources from cloud providers.
The upload_path
field is not available in OSS Ray. See Runtime Environments in Anyscale Jobs and Services for more differences with runtime_env
when using Anyscale Jobs and Services.
Submit and run
To submit your Job to Anyscale, use the CLI or the Python SDK:
- CLI
- Python SDK
anyscale job submit my_production_job.yaml --follow \
--name my-production-job \
--description "A production job running on Anyscale."
import yaml
from anyscale.sdk.anyscale_client.models import *
from anyscale import AnyscaleSDK
sdk = AnyscaleSDK()
job_config = {
# IDs can be found on Anyscale Console under Configurations.
# The IDs below are examples and should be replaced with your own IDs.
'compute_config_id': 'cpt_...',
# The compute config can also specified as a one-off instead:
# 'compute_config': ClusterComputeConfig(
# cloud_id="cld_V1U8Jk3ZgEQQbc7zkeBq24iX",
# region="us-west-2",
# head_node_type=ComputeNodeType(
# name="head",
# instance_type="m5.large",
# ),
# worker_node_types=[],
# ),
# The id of the cluster env build
'build_id': 'bld_...',
'runtime_env': {
'working_dir': 's3://my_bucket/my_job_files.zip'
},
'entrypoint': 'python my_job_script.py --option1=value1',
'max_retries': 3
}
job = sdk.create_job(CreateProductionJob(
name="my-production-job",
description="A production job running on Anyscale.",
# project_id can be found in the URL by navigating to the project in Anyscale Console
project_id='prj_...',
config=job_config
))
Anyscale also supports scheduling jobs for recurring workloads. To schedule a Job, check out Anyscale Schedules
Monitor Job Status
You can check the status of the job on Anyscale Platform's Job page or query it using the CLI/SDK:
- CLI
- Python SDK
anyscale job list --job-id 'prodjob_...'
from anyscale import AnyscaleSDK
sdk = AnyscaleSDK()
job_id = "prodjob_..."
job = sdk.get_production_job(job_id)
View Job logs
You can view logs of the job on Ray Dashboard or follow it using the CLI/SDK:
- CLI
- Python SDK
anyscale job logs --job-id 'prodjob_...' --follow
from anyscale import AnyscaleSDK
sdk = AnyscaleSDK()
job_id = "prodjob_..."
job_logs = sdk.get_production_job_logs(JOB_ID)
By default (for self-hosted deployments) Anyscale does NOT collect nor persist any application logs.
This entails that logs become unavailable for access once Ray cluster is shutdown, unless a third-party logging solution (like CloudWatch, Datadog, etc) is set up.
Anyscale also provides an option to ingest logs into secure logging solution providing easy access to application logs beyond the lifetime of the cluster right from the Anyscale Console. If you're interested please reach out to have this feature enabled for your environment.
Terminate Job
You can terminate a Job from Anyscale Console's Job page or using the CLI/SDK:
- CLI
- Python SDK
anyscale job terminate --job-id 'prodjob_...'
from anyscale import AnyscaleSDK
sdk = AnyscaleSDK()
job_id = "prodjob_..."
sdk.terminate_job(job_id)
Set Job Maximum Runtime
Cluster's maximum_uptime_minutes
configuration that you can specify in the Compute Config is also directly applicable
to Anyscale Jobs: clusters running Anyscale Jobs will be forcibly terminated after maximum_uptime_minutes
irrespective
of the state of the job.
Upon hitting the maximum_uptime_minutes
, the job will be automatically retried in case there are still retry
attempts remaining (configured via max_retries
).
This feature could be particularly useful for collection of the resources of jobs that haven't finished within the allocated time-budget.
Archive Job
Anyscale allows you to archive Jobs to hide them from list view. The clusters associated with the archived Job will be archived automatically.
Once a Job is archived it will be hidden from Job list page on Anyscale Console, but you will still be able to access its details via CLI/SDK.
How to archive Job
- To be archived, Jobs have to be in the terminal state (
Terminated
,Out of Retries
, andSuccess
). - The user must have "write" permission for the Job to archive it.
You can archive Jobs in Anyscale Console, or through the CLI/SDK:
- CLI
anyscale job archive --job-id [JOB_ID]
How to view archived Jobs
You can list archived Jobs by toggling on the "include archived" filter in Anyscale Console, or using the CLI:
- CLI
anyscale job list --include-archived