Default Grafana dashboards
Anyscale and Ray provide many Grafana dashboards out of the box. You can use these dashboards as a starting points for creating your own custom dashboards. When editing dashboards, duplicate them first because Anyscale may update and replace them over time, overwriting your changes. Anyscale may also share these dashboards across multiple Ray apps and clusters, so it's important to duplicate them to avoid affecting other apps accidentally.
These dashboards visualize metrics exported by the Ray Core and Ray libraries. A common modification you may want to make is the addition of graphs to visualize custom application metrics defined by your Ray app.
Ray Core dashboard
The Ray Core dashboard provides visualizations of system level Ray metrics and hardware metrics. This dashboard is useful for monitoring the health of the Ray cluster.
System-level graphs include information about Ray tasks, Ray actors, nodes, the autoscaler, and more. Hardware metrics include CPU, GPU, memory, disk, and network utilization. See the Ray system metrics documentation for more information.
Ray Data dashboard
The Ray Data dashboard provides visualizations of Ray Data metrics. This dashboard is useful for monitoring the health and performance of your Ray Data workloads. Dataset-level graphs include information about dataset outputs (rows, blocks, bytes, tasks), dataset iteration metrics, as well as internal operator metrics (internal queues, object store usage, and more). See the Ray Data monitoring docs for more information.
Ray Serve
Ray Serve dashboard
The Ray Serve dashboard provides visualizations of Ray Serve metrics. This dashboard is useful for monitoring the health of your Ray Serve apps. It includes service metrics such as request latency, request throughput, and service health. It also includes service rollout metrics to track and compare the different service versions. See the Ray Serve monitoring documentation for more information.
Anyscale aggregates the service metrics at the app and route level. To look at metrics grouped by individual replicas, use the Ray Serve deployment dashboard.
Ray Serve deployment dashboard
The Ray Serve deployment dashboard provides visualizations of Ray Serve deployment metrics. This dashboard is useful for monitoring the health of your Ray Serve deployments. It includes service metrics such as request latency, request throughput, and deployment health. See the Ray Serve monitoring documentation for more information.
Anyscale aggregates the service metrics at the deployment level. Anyscale groups metrics by replica ID so you can compare the relative health of different replicas.