Skip to main content

Monitoring a Service

note

Use of Anyscale Services requires Ray 2.3+. Monitoring dashboards specific to Serve requires Ray 2.4+

Anyscale Services provides several tools to monitor your Service:

  1. Service detail page
  2. Service-level metrics
  3. Ray Dashboard
  4. Deployment-level metrics
  5. Logs
  6. Alerts

This document describes each use case and provides suggestions for when to use each tool.

Service detail page

The Service detail page is a the primary source of information about your Service and serves as an entry point into other observability and monitoring tools. It contains a summary of the Service, information about your Service's configuration, some high level details of your Service, and links to various other tools.

There is a list of serve deployments shown at the bottom of the Service detail page.

On a separate tab, there is a the Service events log. This is a list of events related to your Service and includes events about your Service lifecycle, rollouts, and errors.

Service-level metrics

To access this dashboard, open the "Dashboard" menu and click the "Metrics" button in the Service detail page.

Service metrics are the primary source of metrics to understand the overall health of your Service. Primarily, it tracks application-level metrics such as the number of requests, latency, and error rate. It also shows an overview of hardware metrics such as CPU or Network utilization, memory or disk usage, and node count.

This page is powered by Grafana, which is a powerful tool that lets you visualize and explore time-series data. Dropdowns at the top of the dashboard let you filter to specific routes.

The top row of the section contains the "Rollout metrics." High level metrics are split out by Service Version so you can compare the performance of each Service version. If the performance drops, you can decide to roll back a rollout.

The rest of the dashboard contains the "Service metrics," which shows the data across all Service Versions. You can track metrics over time, regardless of the rollouts that have occurred during that time.

Ray Dashboard

The Ray Dashboard is scoped to a single Ray cluster. Each Service Version has one Ray cluster. To access this dashboard, go to the Service detail page. Open the "Dashboard" menu and click the "Ray Dashboard" button for the Service Version you are interested in viewing.

The Ray Dashboard Serve page is a tool to view the status of your Serve applications and Serve system components. The Serve page helps you understand the health of your application, and find additional detail to debug applications.

The top section shows Serve system component status, such as the Serve Controller and HTTP Proxies for each node. You can see if components are healthy. Click into the details of those components, to see logs related to those components. The Serve Controller logs can be a useful source of information when debugging issues with the deployment of your Serve Deployments and Replicas. The HTTP Proxy logs can be useful to understand if your Service is receiving requests as expected.

The bottom section of the Serve page shows details of your Serve applications. You can see a high level status of each application or you can click into an application to see details.

The Serve Application detail page shows the high level status of your Serve application. It shows the number of deployments, replicas, and their statuses for the application. Click the "Metrics" button for each deployment to visit its Deployment dashboard. Click into a replica to see more details about that replica, including its logs.

The Serve Replica detail page shows detailed information about the Serve Replica, including the logs, metrics, and a task history table. You can use the task history table to see all the Ray tasks called on this replica and the logs associated with those Ray tasks.

Deployment-level metrics

The deployment dashboard is scoped to a single Ray Serve deployment. To access this dashboard, click the "Metrics" button for a deployment in the Service Detail page or click the "Metrics" button for a deployment in the Ray Dashboard Serve Application detail page.

Deployment dashboard shows detailed metrics for a single deployment. It shows request QPS, errors, latency, and other metrics. The page is powered by Grafana which is a powerful tool that lets you visualize and explore time-series data.

Dropdowns at the top of the dashboard let you filter to specific replicas or specific routes.

Grafana dashboard data is retained for 90 days from cluster termination.

Logs

For a running Service

For running Services, you can view the logs using the Ray Dashboard. The serve page contains logs for the HTTPProxy, Serve Controller, or any of the Replicas of your Service. You must click into that component's detail page to view those logs.

For an inactive Service

For a terminated or unhealthy Service, you can download the logs files from the one or more clusters that served the Service.

info

Known Limitations

  • Only application and Ray system logs are persisted and able to be downloaded. Downloading logs from ray_results is not supported yet.

Download Logs

Use anyscale logs to view and download logs. View the "Ray Logs" tab in the cluster page on the Console UI for more instructions.

# View logs for a particular cluster. Cluster ID can be found by going to the Service page and finding the cluster the Service was run on. It should look something like ses_8kVvPt6pNkR7xJlEE2zfQQXW
anyscale logs cluster --id <cluster-id> <glob-filter | filename>

# Download all logs for a particular cluster
anyscale logs cluster --id <cluster-id> --download

# More help
anyscale logs cluster --help

Adding custom logs in your application.

See the Ray Serve monitoring docs for how to add custom logs to your application.

Alerts

note

This is currently in development, please contact our support team if you would like this feature.

Anyscale sends you email notifications when the following events happen on a Service created by you:

  • Service is being restarted after failure
  • Service successfully starts up after failure

These emails are sent from Anyscale Alerts (do not reply) <alerts@console.anyscale.com>.

By default you are automatically subscribed for all notification emails. You can manage subscription by clicking the Unsubscribe link in the footer of the email. Clicking the Unsubscribe link will take you to a subscription preferences page (see below screenshot), where you can selectively subscribe/unsubscribe from these notifications emails (topics). Make sure to click the Update button at the bottom of this page after checking/unchecking your subscription preferences. You will stop receiving emails for topics that are unchecked after the update. To subscribe again, simply check and update the topic in subscription preferences page.

Summary

When to use each tool depends on your task:

  • The Service detail page is the entry point into information about your Service and provides links to other tools.
  • Use the Service dashboard to monitor the overall health of your Service and its deployments.
  • Use the Ray Dashboard to monitor the Serve application in more detail, or to debug issues with your application.
  • Use the Deployment dashboard to monitor the performance of a specific deployment or to debug a specific deployment. Often times, you can enter the dashboard from the Ray Dashboard.

More info