Skip to main content

Production Jobs

Use of Production jobs requires Ray 1.9+.

Production jobs are discrete workloads managed by Anyscale. A developer designs and packages code for a job, and then submits the job to Anyscale for execution and cluster lifecycle management. These types of jobs are best suited for workflows where you want Anyscale to handle starting up the cluster and react to failures.

info

Production jobs, or Anyscale jobs, are related to but distinct from Ray Jobs. You can still use Ray Jobs on Anyscale.

Use Anyscale's Production jobs (e.g. using anyscale job submit or the Anyscale SDK) when you

  • have code ready to package and run
  • your workflow is designed to use a whole cluster end-to-end.

Use Ray Jobs (e.g. using ray job submit or the Ray Jobs SDK) when you need to iterate, because:

  • You can run Ray jobs existing or running cluster.
  • You can iterate over Ray job development as you harden a job for production deployment.

After a developer (or automated process) submits a job definition, Anyscale:

  1. Creates a cluster,
  2. Runs the job on it, and
  3. Monitors the job for completion or failure.

If the job fails, Anyscale restarts the job and gives up after a configurable number of retries (see max_retries below).

Using production jobs

Define

When you submit a production job, you must provide the following:

  • (Required) compute_config A cluster compute config for the cluster the job will run on. On the SDK, this can only be specified as compute_config_id. On the CLI, you may specify compute_config (the name of the cluster compute config) or cloud (the name of an Anyscale cloud) instead for convenience. This attribute is optional in the CLI. The SDK example below shows how to resolve a missing value, or a compute_config or a cloud into a compute_config_id.
  • (Required) cluster_env A cluster environment for the cluster the job will run on. On the SDK, this can only be specified as build_id. On the CLI, you may specify cluster_env (the name and version for the cluster environment, colon-separated; if you don't specify a version, the latest will be used). This attribute is optional in the CLI. The SDK example below shows how to resolve a missing value, or a cluster_env into a build_id.
  • (Optional) runtime_env A runtime environment containing your application code and dependencies.
  • (Required) entrypoint The entrypoint command that will be run on the cluster to run the job. Although it is generally a python script, the entry point can be any shell command.
  • (Optional) max_retries A maximum number of retries before the job is considered failed (defaults to 5).
  • (Optional) name Name of the job. Job names can be the same within a project and across projects.
  • (Optional) project_id The id of the project you want the job to run in. project_id can be found in the URL by navigating to the project in Anyscale Console.
note

The working_dir option of the runtime_env, typically containing your job script and other necessary files, can be specified in one of two ways.

The first option is to set working_dir to be a remote URI: either a zip file preexisting in an AWS S3 or Google Cloud Storage bucket, or a zip file accessible over HTTP/S (e.g., a GitHub download URL). The zip file must contain only a single top-level directory; see Remote URIs for details.

The second option is to set working_dir to be a local directory. If you do this, you are responsible for providing external storage: your runtime_env must include the field upload_path specifying an Amazon S3 or Google Storage bucket or subdirectory. Your local directory will then be uploaded to remote storage and downloaded by the cluster before running the job. External storage allows for Anyscale to start a new cluster with your working_dir in the case of a job failure.

Example: runtime_env = {"working_dir": "/User/code", "upload_path": "s3://my-bucket/subdir"}

Whether using the first or second option, the cluster running the job must have permissions to download from the bucket or URL. For more on permissions, see our Cloud Access Overview, or Using a Local Directory with Anyscale Jobs and Services on GCP if using GCP.

These options are specified in a configuration file:

my_production_job.yaml
name: "my-first-job"
project_id: "prj_7S7Os7XBvO6vdiVC1J0lgj"
compute_config: my-compute-config # You may specify `compute_config_id` or `cloud` instead
cluster_env: my-cluster-env:5 # You may specify `build_id` instead
runtime_env:
working_dir: "s3://my_bucket/my_job_files.zip"
# You may also specify other runtime environment properties like `pip` and `env_vars`
entrypoint: "python my_job_script.py --option1=value1"
max_retries: 3

Submit

To submit your job to Anyscale, use the the CLI or the Python SDK:

anyscale job submit my_production_job.yaml \
--name my-production-job \
--description "A production job running on Anyscale."

Monitor

You can check the status of a job on the Web UI, or query it using the CLI or the Python SDK:

anyscale job list --job-id [JOB_ID]

Terminate

You can attempt to terminate a job using the CLI or the Python SDK:

anyscale job terminate --job-id [JOB_ID]

Example

You can try out production jobs using a public GitHub repo. The following hello_world.py example is located on Anyscale's docs_examples repo:

hello_world.py
import ray
import anyscale

@ray.remote
def say_hi(message):
return f"Hello, {message}."

ray.init()
print(ray.get(say_hi.remote("World")))

You can define the following job configuration file:

hello_world.yaml
entrypoint: "python hello_world.py"
runtime_env:
working_dir: "https://github.com/anyscale/docs_examples/archive/refs/heads/main.zip"

Run, monitor and stop the job as follows:

# Submit the job defined above using the CLI
anyscale job submit hello_world.yaml --name hello_world
No cluster environment provided, setting default based on local Python and Ray versions.
No cloud or cluster compute config specified, using the default: cpt_5e2TnnRv8m6hyPXqCg5CTEfX.
Job prodjob_e1xd11CfHKditQK6XJnx4zqZ has been successfully submitted. Current state of job: PENDING.
Query the status of the job with `anyscale job list --job-id prodjob_e1xd11CfHKditQK6XJnx4zqZ`.
View the job in the UI at https://console.anyscale-staging.com/jobs/prodjob_e1xd11CfHKditQK6XJnx4zqZ.

# Query the status of the service using the ID returned above
anyscale job list --job-id prodjob_e1xd11CfHKditQK6XJnx4zqZ
View your jobs in the UI at https://console.anyscale-staging.com/jobs
Jobs:
NAME ID COST PROJECT NAME CLUSTER NAME CURRENT STATE CREATOR ENTRYPOINT
hello_world prodjob_e1xd11CfHKditQK6XJnx4zqZ 0 my-project cluster_for_prodjob_e1xd11CfHKditQK6XJnx4zqZ_HII5B88lSl RUNNING user python hello_world.py

# Terminate the job
anyscale job terminate --job-id prodjob_e1xd11CfHKditQK6XJnx4zqZ
Job prodjob_e1xd11CfHKditQK6XJnx4zqZ has begun terminating...
Current state of job: RUNNING. Goal state of job: TERMINATED
Query the status of the job with `anyscale job list --job-id prodjob_e1xd11CfHKditQK6XJnx4zqZ`.

# Query the status again
anyscale job list --job-id prodjob_e1xd11CfHKditQK6XJnx4zqZ
View your jobs in the UI at https://console.anyscale-staging.com/jobs
Jobs:
NAME ID COST PROJECT NAME CLUSTER NAME CURRENT STATE CREATOR ENTRYPOINT
hello_world prodjob_e1xd11CfHKditQK6XJnx4zqZ 0 my-project TERMINATED user python hello_world.py