Skip to main content

Develop and Deploy a Ray Serve Application

This tutorial covers the end to end workflow of developing, testing, and deploying a Ray Serve application on Anyscale. In particular, it covers the following workflows:

  • The development workflow using Anyscale Workspaces.
  • The production workflow using Anyscale Services.

Throughout the tutorial, we will be using a Huggingface natural language model to classify sentiment of text. Even though we will only be serving one single model, you can adapt this tutorial to scale out the number of models being served.

Development workflow

Create a workspace

The development workflow is similar to any other Ray libraries. We recommend using Anyscale Workspaces to iterate on your Ray application. In particular to serving use case, workspaces provides a persistent and scalable development environment making it easy to test your machine learning models.

To start a workspace, you can either use the Web UI or the CLI.

On your laptop, make sure anyscale CLI is installed:

pip install -U anyscale

Create a workspace using the anyscale workspace CLI command. To start, we need the following parameters:

  • project-id: you can obtain this by calling anyscale project list.

  • cloud-id: you can obtain this by calling anyscale cloud list.

  • compute-config-id: you can create one using the following file and command:

    cloud_id: cld_xyz # TODO: fill in your cloud id
    name: head_node
    instance_type: m5.2xlarge
    - name: cpu_worker
    instance_type: m5.4xlarge
    min_workers: 0
    max_workers: 10
    use_spot: true
    - name: gpu_worker
    instance_type: g4dn.4xlarge
    min_workers: 0
    max_workers: 10
    use_spot: true
    $ anyscale cluster-compute create compute_config.yaml --name serve-tutorial-config

    Loaded Anyscale authentication token from ~/.anyscale/credentials.json.

    (anyscale +0.6s) View this cluster compute at:
    (anyscale +0.6s) Cluster compute id: cpt_Hsmn2dtxAiZytWZ3iPtwTD1y
    (anyscale +0.6s) Cluster compute name: serve-tutorial-config
  • cluster-env-build-id: We will use the default build id anyscaleray-ml210-py37-gpu which provides most of the necessary environment already. You can learn more about crafting your own cluster environment here.

Now we have all the parameters, let's create the workspace:

anyscale workspace create \
--name serve-tutorial \
--project-id "prj_a3cug4HTAiLFg2TiiY1K5ftZ" \
--cloud-id "cld_4F7k8814aZzGG8TNUGPKnc" \
--compute-config-id "cpt_Hsmn2dtxAiZytWZ3iPtwTD1y" \
--cluster-env-build-id "anyscaleray-ml210-py37-gpu"

You can check the status of the workspace via the web console ( The workspace should be ready in a few minutes.

You can access the workspace is several ways:

  • Jupyter Lab
  • VS Code Web
  • VS Code Desktop
  • SSH Terminal

Write your application

For Ray Serve application, we recommend writing the program as Python scripts and modules, instead of using Jupyter Notebooks. Therefore, VS Code or SSH is preferred.

The development workflow of Ray Serve on Anyscale follows the one you can do on your laptop using Open Source Ray Serve:

  • Use serve run to iterate with with your application
  • Test the application with HTTP requests
  • Update your code and repeat

Now we will show how to try run the sentiment analysis model on Anyscale workspaces:

Open the workspace through either the web browser version or your desktop app. You can start by selecting VS Code Desktop under Tools menu.

If you don't have local VS Code installed, you can use the hosted VS Code option.

Once VSCode open, you should see an empty workspace at the start. In the VS Code terminal, type

# Initialize the workspace folder with the example repo
git clone .

You can iterate on the code repo with standard git workflow. The files are persisted across workspace restarts as well.

Now let's open sentiment_analysis folder and view the file. You can directly edit the file, along with proper type hints and auto-completion built-in.

Test your application

Now, let's run the Serve application.

Open the VS Code terminal:
serve run

The output will be similar to the following:

2022-12-15 11:33:25,404 INFO -- Deploying from import path: "app:model".
2022-12-15 11:34:39,841 INFO -- Connecting to existing Ray cluster at address:
2022-12-15 11:34:39,849 INFO -- Connected to Ray cluster. View the dashboard at
2022-12-15 11:34:39,852 INFO -- Pushing file package 'gcs://' (0.04MiB) to Ray cluster...
2022-12-15 11:34:39,852 INFO -- Successfully pushed file package 'gcs://'.
raylet) 2022-12-15 11:34:39,863 INFO -- Successfully created runtime env: {"working_dir": "gcs://"}, the context: {"command_prefix": ["cd", "/tmp/ray/session_2022-12-15_09-29-44_780598_165/runtime_resources/working_dir_files/_ray_pkg_094d425b0eb2726023050ff58001a46e", "&&"], "env_vars": {"PYTHONPATH": "/tmp/ray/session_2022-12-15_09-29-44_780598_165/runtime_resources/working_dir_files/_ray_pkg_094d425b0eb2726023050ff58001a46e"}, "py_executable": "/home/ray/anaconda3/bin/python", "resources_dir": null, "container": {}, "java_jars": []}
(ServeController pid=25363) INFO 2022-12-15 11:34:41,988 controller 25363 - Starting HTTP proxy with name 'SERVE_CONTROLLER_ACTOR:SERVE_PROXY_ACTOR-0f19344b4199e1b8b3c4db4638afc1bd47e072dbe5dba896fac7c5e3' on node '0f19344b4199e1b8b3c4db4638afc1bd47e072dbe5dba896fac7c5e3' listening on ''
(HTTPProxyActor pid=25417) INFO: Started server process [25417]
(ServeController pid=25363) INFO 2022-12-15 11:34:43,933 controller 25363 - Adding 1 replica to deployment 'LanguageModel'.
(ServeReplica:LanguageModel pid=25460) No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english (
Downloading: 0%| | 0.00/255M [00:00<?, ?B/s]
Downloading: 100%|██████████| 226k/226k [00:00<00:00, 799kB/s]
2022-12-15 11:35:03,897 SUCC -- Deployed successfully.

This command is a blocking command, it will run in the terminal forever until you interrupt it with Control+C command.

You can now query it using any HTTP client. For this example, we will demonstrate it using Python requests module. Open a new terminal using VS Code or SSH:
import requests

resp = requests.get(
"http://localhost:8000/predict", params={"text": "Anyscale workspaces are great!"}
$ python

(Optional) Edit your application

Now you get an application working. You can try to edit the application for your own use case. Here are few suggestions:

  • As compared to local laptop development experience, Anyscale workspace allows you to GPU node as well as multiple nodes. Try changing the @serve.deployment(route_prefix="/") to the following to see what happens!
    • Add a GPU using @serve.deployment(route_prefix="/", ray_actor_options={"num_gpus": 1})
    • Run multiple replicas using @serve.deployment(route_prefix="/", num_replicas=10)
  • Try a different HuggingFace pipeline that's more heavy weight! Anyscale workspace gives you accessible to powerful VMs in the cloud, which also have fast network connection to cloud object storage. No more waiting for the model to download to your model or dev box!
    • Change the model to a heavy model like text generation by setting model = LanguageModel.bind(task="text-generation")
    • Add support for configuring the model by pass in init parameters similar to task.

The takeaway of these tasks is that these are not something that comes easily with local laptop. Anyscale workspace is the natural place to develop your serving application and have access to elastic resource.


Pausing and continuing workspace

As long as you terminate the serve run command. Anyscale will automatically shutdown idle clusters after some timeout (default to 2hr).

When you come back to work on the same project, you can just resume the workspace by starting it in the UI or running anyscale workspace start in the project directory.

Moving to production

Once you finish development the application. You can create an Anyscale production service to deploy the application. Anyscale production services help you to run your Ray Serve workload with fault tolerance and automatic recovery. You can learn more here.

Set up your service

To use Anyscale services, you can use the CLI command either from your laptop or the development workspace.

First, saves the YAML configuration file for service. Notice that the entrypoint is similar to the command we used to test the model. The difference is the --non-blocking so the application is safely deployed into the background. You are encouraged to check in this configuration file into your version control system as well.

name: "sentiment-service"
cluster_env: default_cluster_env_ml_2.2.0_py37:1
# working_dir is optional when deploy from a workspace.
working_dir: ""
entrypoint: "serve run --non-blocking"
healthcheck_url: "/-/healthz"
access: "public"

For more information regarding the YAML schema, checkout the reference documentation.

Next, you can deploy the service using the following CLI command. You should see relevant links in the output.

$ anyscale service deploy config.yaml
Loaded Anyscale authentication token from ~/.anyscale/credentials.json.

(anyscale +1.6s) No cloud or compute config specified, using the default: cpt_bt3qwcnfqpa6p971xkvvkxms4i.
(anyscale +1.7s) No project specified. Continuing without a project.
(anyscale +2.4s) Maximum uptime is disabled for clusters launched by this service.
(anyscale +2.4s) Service service_zezrkzehhbrmem7mliq3xrnt has been deployed. Current state of service: PENDING.
(anyscale +2.4s) Query the status of the service with `anyscale service list --service-id service_zezrkzehhbrmem7mliq3xrnt`.
(anyscale +2.4s) View the service in the UI at

You can then visit the service link to:

  • Observe the current status of the service
  • View service logs and metrics
  • View the OpenAPI documentation page
  • Query the API

When you are using Anyscale workspace, the working_dir will be automatically configured for you. Therefore, you don't need to push the code to a git repo.

- working_dir:

If you inspect the service configuration in the created service page, you can see the runtime environment being automatically populated for you.

runtime_env: {
"working_dir": "file:///efs/jobs/599fc65fefb14e7180f44a1583751647/"

If you want to upload your local content instead of linking to a repo, you can use the working directory upload feature. You can either do this from your workspace, your laptop, or (preferred) a CI/CD environment.

- working_dir:
+ working_dir: .
+ upload_path: "s3://your-bucket/path"

See more in the production service reference page

Query your service

To query the service, you should visit the service page and click the "Query" button at the top right. Anyscale services are exposed over HTTP protocol. You can use any HTTP client as long as you add the Bearer Token in the headers:

Upgrade your service

To upgrade th service, you can edit the YAML with updated working_dir and call anyscale service deploy again.


Please note that this upgrade will incur service downtime. Zero-downtime upgrade and traffic management is actively in the work. Please contact your support team to hear more!


To shut it down, you can run

$ anyscale service terminate --name sentiment-service

Loaded Anyscale authentication token from ~/.anyscale/credentials.json.

(anyscale +4.1s) Service service_zezrkzehhbrmem7mliq3xrnt has begun terminating...
(anyscale +4.1s) Current state of service: RUNNING. Goal state of service: TERMINATED
(anyscale +4.1s) Query the status of the service with `anyscale service list --service-id service_zezrkzehhbrmem7mliq3xrnt`.