Skip to main content

Develop and Deploy

This tutorial covers the end to end workflow of developing, testing, and deploying a Ray Serve application on Anyscale. In particular, it covers the following workflows:

  • The development workflow using Anyscale Workspaces.
  • The production workflow using Anyscale Services.

Throughout the tutorial, we will be using a HuggingFace natural language model to classify sentiment of text. Even though we will only be serving one single model, you can adapt this tutorial to scale out the number of models being served.

Development workflow

Create a workspace

The development workflow is similar to any other Ray libraries. Use Anyscale Workspaces to iterate on your Ray application. Especially for the serving use case, Workspaces provide a persistent and scalable development environment for you to easily test machine learning models.

To start a workspace, you can either use the Web UI or the CLI.

On your laptop, make sure anyscale CLI is installed:

pip install -U anyscale

Create a workspace using the anyscale workspace CLI command. To start, we need the following parameters:

  • project-id: you can obtain this by calling anyscale project list.

  • cloud-id: you can obtain this by calling anyscale cloud list.

  • compute-config-id: you can create one using the following file and command:

    cloud_id: cld_xyz # TODO: fill in your cloud id
    name: head_node
    instance_type: m5.2xlarge
    - name: cpu_worker
    instance_type: m5.4xlarge
    min_workers: 0
    max_workers: 10
    use_spot: true
    - name: gpu_worker
    instance_type: g4dn.4xlarge
    min_workers: 0
    max_workers: 10
    use_spot: true
    $ anyscale cluster-compute create compute_config.yaml --name serve-tutorial-config

    Loaded Anyscale authentication token from ~/.anyscale/credentials.json.

    (anyscale +0.6s) View this cluster compute at:
    (anyscale +0.6s) Cluster compute id: cpt_Hsmn2dtxAiZytWZ3iPtwTD1y
    (anyscale +0.6s) Cluster compute name: serve-tutorial-config
  • cluster-env-build-id: Anyscale uses the default build ID anyscaleray-ml270optimized-py310-gpu, which provides most of the necessary environment already. Learn more about crafting your own cluster environment here.

Now we have all the parameters, let's create the workspace:

anyscale workspace create \
--name serve-tutorial \
--project-id "prj_a3cug4HTAiLFg2TiiY1K5ftZ" \
--cloud-id "cld_4F7k8814aZzGG8TNUGPKnc" \
--compute-config-id "cpt_Hsmn2dtxAiZytWZ3iPtwTD1y" \
--cluster-env-build-id "anyscaleray-ml270optimized-py310-gpu"

You can check the status of the workspace via the web console ( The workspace should be ready in a few minutes.

You can access the workspace in several ways:

  • Jupyter Lab
  • VS Code Web
  • VS Code Desktop
  • SSH Terminal

Write your application

For Ray Serve application, we recommend writing the program as Python scripts and modules, instead of using Jupyter Notebooks. Therefore, VS Code or SSH is preferred.

The development workflow of Ray Serve on Anyscale follows the one you can do on your laptop using Open Source Ray Serve:

  • Test the application with HTTP requests
  • Update your code and repeat

Now we will show how to try run the sentiment analysis model on Anyscale workspaces:

Open the workspace through either the web browser version or your desktop app. You can start by selecting VS Code Desktop under Tools menu.

If you don't have local VS Code installed, you can use the hosted VS Code option.

Once VSCode open, you should see an empty workspace at the start. In the VS Code terminal, type

# Initialize the workspace folder with the example repo
git clone .

You can iterate on the code repo with standard git workflow. The files are persisted across workspace restarts as well.

Now let's open sentiment_analysis folder and view the file. You can directly edit the file, along with proper type hints and auto-completion built-in.

Test your application

Now, let's run the Serve application.

Open the VS Code terminal:

cd sentiment_analysis && serve run app:model

The output will be similar to the following:

2022-12-15 11:33:25,404 INFO -- Deploying from import path: "app:model".
2022-12-15 11:34:39,841 INFO -- Connecting to existing Ray cluster at address:
2022-12-15 11:34:39,849 INFO -- Connected to Ray cluster. View the dashboard at
2022-12-15 11:34:39,852 INFO -- Pushing file package 'gcs://' (0.04MiB) to Ray cluster...
2022-12-15 11:34:39,852 INFO -- Successfully pushed file package 'gcs://'.
raylet) 2022-12-15 11:34:39,863 INFO -- Successfully created runtime env: {"working_dir": "gcs://"}, the context: {"command_prefix": ["cd", "/tmp/ray/session_2022-12-15_09-29-44_780598_165/runtime_resources/working_dir_files/_ray_pkg_094d425b0eb2726023050ff58001a46e", "&&"], "env_vars": {"PYTHONPATH": "/tmp/ray/session_2022-12-15_09-29-44_780598_165/runtime_resources/working_dir_files/_ray_pkg_094d425b0eb2726023050ff58001a46e"}, "py_executable": "/home/ray/anaconda3/bin/python", "resources_dir": null, "container": {}, "java_jars": []}
(ServeController pid=25363) INFO 2022-12-15 11:34:41,988 controller 25363 - Starting HTTP proxy with name 'SERVE_CONTROLLER_ACTOR:SERVE_PROXY_ACTOR-0f19344b4199e1b8b3c4db4638afc1bd47e072dbe5dba896fac7c5e3' on node '0f19344b4199e1b8b3c4db4638afc1bd47e072dbe5dba896fac7c5e3' listening on ''
(HTTPProxyActor pid=25417) INFO: Started server process [25417]
(ServeController pid=25363) INFO 2022-12-15 11:34:43,933 controller 25363 - Adding 1 replica to deployment 'LanguageModel'.
(ServeReplica:LanguageModel pid=25460) No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english (
Downloading: 0%| | 0.00/255M [00:00<?, ?B/s]
Downloading: 100%|██████████| 226k/226k [00:00<00:00, 799kB/s]
2022-12-15 11:35:03,897 SUCC -- Deployed successfully.

This command is a blocking command, it will run in the terminal forever until you interrupt it with Control+C command.

You can now query it using any HTTP client. For this example, we will demonstrate it using Python requests module. Open a new terminal using VS Code or SSH:

import requests

resp = requests.get(
"http://localhost:8000/predict", params={"text": "Anyscale workspaces are great!"}
$ python

(Optional) Make the endpoint accessible from your local machine's browser

Follow these steps on your local machine to expose the port on your localhost. Open a terminal and execute the following commands:

# Clone the workspace
anyscale workspace clone --name serve-tutorial
# Navigate to the workspace directory
cd serve-tutorial
# Set up SSH local port forwarding to expose port 8000
anyscale workspace ssh -- -L 8000:localhost:8000

Once you've completed the above steps for your chosen development environment, you can visit the endpoint directly in your browser. For instance, http://localhost:8000/predict?text=Anyscale%20are%20great!.

(Optional) Edit your application

Now that you have a working application, you can edit the application for your own use case. Here are few suggestions:

  • Use Anyscale Workspaces to leverage multiple nodes and GPU nodes. Try changing the @serve.deployment(route_prefix="/") to the following to see what happens!
    • Add a GPU using @serve.deployment(route_prefix="/", ray_actor_options={"num_gpus": 1})
    • Run multiple replicas using @serve.deployment(route_prefix="/", num_replicas=10)
  • Try a more heavyweight HuggingFace pipeline. Anyscale Workspaces give you access to powerful instances in the cloud, with faster network connections to cloud object storage--no more waiting for the model to download!
    • Change the model to a heavy model like text generation by setting model = LanguageModel.bind(task="text-generation")
    • Add support for configuring the model by pass in init parameters similar to task.

The takeaway of these tasks is that these are not something that comes easily with local laptop. Anyscale workspace is the natural place to develop your serving application and have access to elastic resource. For the complete list of configurable options for your serve deployment, please refer to the Ray Serve deployment docs.


Pausing and continuing workspace

As long as you terminate the python command. Anyscale will automatically shutdown idle clusters after some timeout (defaults to 2 hours).

When you come back to work on the same project, you can just resume the workspace by starting it in the UI or running anyscale workspace start in the project directory.

Moving to production

After completing the development of your application, you have the ability to deploy it using Anyscale production services. Anyscale production services offer the benefits of running your Ray Serve workload with high availability and fault tolerance. You can learn more here.

Set up your service

To use Anyscale services, you can use the CLI command or python SDK. These can be executed from your personal laptop or within a development workspace.

First, create and update the YAML configuration file for service.

name: "sentiment-service"
cluster_env: default_cluster_env_ml_2.9.0_py310
- name: sentiment_analysis
import_path: ""
working_dir: ""

For more information regarding the YAML schema, checkout the API reference.

Next, you can deploy the service using the following CLI command. You should see relevant links in the output.

$ anyscale service rollout -f config.yaml
Loaded Anyscale authentication token from ~/.anyscale/credentials.json.

(anyscale +4.5s) Service service2_v96jsyvutffntejfivh3inczcd has been deployed. Service is transitioning towards: RUNNING.
(anyscale +4.5s) View the service in the UI at

You can then visit the service link to:

  • Observe the current status of the service
  • View service logs and metrics
  • View the OpenAPI documentation page
  • Query the API

After cloning the repository, you can set the working_dir to . if the Service YAML definition file is in the same folder as the python files you are deploying.

- working_dir:
+ working_dir: "."

If you inspect the service configuration in the created service page, you can see the runtime environment being automatically populated for you.

working_dir: >-

You can also upload to a bucket of your choice by adding an upload_path line.

- working_dir:
+ working_dir: .
+ upload_path: "s3://your-bucket/path"

Query your service

To query your service, navigate to the service page and select the 'Query' button located at the top right corner. This will display both the curl and Python commands. Moreover, the Python SDK provides a programmatic way to retrieve the token and URL needed to query the service.

Anyscale services are exposed over the HTTP protocol. Thus, you can use any HTTP client as long as you add the Bearer Token in the headers.


To shut down the service, you can use the Anyscale Web Console, the CLI, or the Python SDK.

Find your Anyscale Service in the Web Console and click the Terminate button.