Monitoring a Service
This version of the Anyscale docs is deprecated. Go to the latest version for up to date information.
Use of Anyscale Services requires Ray 2.3+. Monitoring dashboards specific to Serve requires Ray 2.4+
Anyscale Services provides several tools to monitor your Service:
This document describes each use case and provides suggestions for when to use each tool.
Service detail page
The Service detail page is a the primary source of information about your Service and serves as an entry point into other observability and monitoring tools. It contains a summary of the Service, information about your Service's configuration, some high level details of your Service, and links to various other tools.
There is a list of serve deployments shown at the bottom of the Service detail page.
On a separate tab, there is a the Service events log. This is a list of events related to your Service and includes events about your Service lifecycle, rollouts, and errors.
Service-level metrics
To access this dashboard, open the "Dashboard" menu and click the "Metrics" button in the Service detail page.
Service metrics are the primary source of metrics to understand the overall health of your Service. Primarily, it tracks application-level metrics such as the number of requests, latency, and error rate. It also shows an overview of hardware metrics such as CPU or Network utilization, memory or disk usage, and node count.
This page is powered by Grafana, which is a powerful tool that lets you visualize and explore time-series data. Dropdowns at the top of the dashboard let you filter to specific routes.
The top row of the section contains the "Rollout metrics." High level metrics are split out by version so you can compare the performance of each version. If the performance drops, you can decide to roll back a rollout.
The rest of the dashboard contains the "Service metrics," which shows the data across all versions. You can track metrics over time, regardless of the rollouts that have occurred during that time.
Ray Dashboard
The Ray Dashboard is scoped to a single Ray cluster. Each version has one Ray cluster. To access this dashboard, go to the Service detail page. Open the "Dashboard" menu and click the "Ray Dashboard" button for the version you are interested in viewing.
The Ray Dashboard Serve page is a tool to view the status of your Serve applications and Serve system components. The Serve page helps you understand the health of your application, and find additional detail to debug applications.
The top section shows Serve system component status, such as the Serve Controller and HTTP Proxies for each node. You can see if components are healthy. Click into the details of those components, to see logs related to those components. The Serve Controller logs can be a useful source of information when debugging issues with the deployment of your Serve Deployments and Replicas. The HTTP Proxy logs can be useful to understand if your Service is receiving requests as expected.
The bottom section of the Serve page shows details of your Serve applications. You can see a high level status of each application or you can click into an application to see details.
The Serve Application detail page shows the high level status of your Serve application. It shows the number of deployments, replicas, and their statuses for the application. Click the "Metrics" button for each deployment to visit its Deployment dashboard. Click into a replica to see more details about that replica, including its logs.
The Serve Replica detail page shows detailed information about the Serve Replica, including the logs, metrics, and a task history table. You can use the task history table to see all the Ray tasks called on this replica and the logs associated with those Ray tasks.
Deployment-level metrics
The deployment dashboard is scoped to a single Ray Serve deployment. To access this dashboard, click the "Metrics" button for a deployment in the Service Detail page or click the "Metrics" button for a deployment in the Ray Dashboard Serve Application detail page.
Deployment dashboard shows detailed metrics for a single deployment. It shows request QPS, errors, latency, and other metrics. The page is powered by Grafana which is a powerful tool that lets you visualize and explore time-series data.
Dropdowns at the top of the dashboard let you filter to specific replicas or specific routes.
Grafana dashboard data is retained for 90 days from cluster termination.
Logs
For a running Service
For running Services, you can view the logs using the Ray Dashboard. The serve page contains logs for the HTTPProxy, Serve Controller, or any of the Replicas of your Service. You must click into that component's detail page to view those logs.
For an inactive Service
For a terminated or unhealthy Service, you can download the logs files from the one or more clusters that served the Service.
Known Limitations
- Only application and Ray system logs are persisted and able to be downloaded. Downloading logs from
ray_results
is not supported yet.
Download Logs
Use anyscale logs
to view and download logs. View the "Ray Logs" tab in the cluster page on the Console UI for more instructions.
# View logs for a particular cluster. Cluster ID can be found by going to the Service page and finding the cluster the Service was run on. It should look something like ses_8kVvPt6pNkR7xJlEE2zfQQXW
anyscale logs cluster --id <cluster-id> <glob-filter | filename>
# Download all logs for a particular cluster
anyscale logs cluster --id <cluster-id> --download
# More help
anyscale logs cluster --help
Adding custom logs in your application.
See the Ray Serve monitoring docs for how to add custom logs to your application.
Alerts
This is currently in development, please contact our support team if you would like this feature.
Summary
When to use each tool depends on your task:
- The Service detail page is the entry point into information about your Service and provides links to other tools.
- Use the Service dashboard to monitor the overall health of your Service and its deployments.
- Use the Ray Dashboard to monitor the Serve application in more detail, or to debug issues with your application.
- Use the Deployment dashboard to monitor the performance of a specific deployment or to debug a specific deployment. Often times, you can enter the dashboard from the Ray Dashboard.
More info
- To learn more about monitoring Serve applications, see the Serve monitoring documentation
- To learn more about Grafana and how to use it, see the official Grafana documentation
- To learn more about the metrics that Ray Serve emits, see the Serve Metrics documentation
- To learn more details about the Ray Dashboard, see the Ray Dashboard documentation
- Custom metrics can be added to your applications. To learn more, see the Ray Custom Metrics documentation. To visualize these metrics, you can add them to a Grafana dashboard.