Skip to main content

Observability and alerting

Endpoints Server overview page

The Endpoints Server overview page provides high-level details and configuration options for your deployed models. The Monitor tab offers status information on events and common metrics at a glance. For more advanced observability tooling, see the following section.

Endpoints Server Grafana dashboard

Grafana powers the Endpoints Server monitoring page and lets you visualize metrics, logs, and traces. Access this dashboard by clicking on the Grafana icon at the top-right side of the Endpoints Server overview page.

For a better understanding of the Grafana dashboard, here's a curated list of essential metrics to understand the overall health of your deployed models:

Scaling Information
  • Active Nodes: Current nodes in the cluster.
  • Pending Nodes: Nodes in provisioning.
GPUs
  • Available GPUs: Total GPUs available for provisioning.
  • GPU Utilization: Utilization per GPU.
  • 1/5/15 Minute GPU Load Averages: Average utilization per GPU over specified intervals.
Replicas and Nodes
  • Replicas Per Application: Replicas running per application.
At a glance
  • Tokens Last Hour: Total input and generated tokens over the last hour.
  • Tokens Last 24 Hours: Total input and generated tokens over the last day.
  • Tokens Per Model Last 24 Hours: Tokens generated per model over the last day.
  • Distribution of Requests Per Model Last 24 Hours: Total requests and responses over the last hour.
  • Requests Last Hour: Pie diagram of request distribution across all models over the last day.
  • Ratio Input (Generated Tokens Last 24 Hours): Rate of token usage per model.
Endpoints
  • RPS: Requests per second directed to each API server.
  • Successful: Successful requests.
  • Failures: Failed requests.
Tokens and Requests Graph
  • Tokens Input Per Minute: Rate of input tokens per minute.
  • Tokens Input Per Request: Rate of input tokens per request.
  • Tokens Started Per Minute Per Model: Rate of requests per minute.
  • Tokens Generated Per Minute: Rate of generated tokens per minute.
  • Tokens Generated Per Request: Rate of generated tokens per request.
  • Request Errors Per Minute Per Model: Rate of error requests per minute per model.
Latencies
  • First Token Stream Latency: First token latency by percentiles.
  • Response Stream Latency: Response stream latency by percentiles.

Alerts

Anyscale preconfigures alerts through Grafana to send notifications when incidents occur with your Endpoints Server. View alerts on the alerting page. Find the corresponding metrics in the Endpoints Server metrics dashboard under the Endpoint Alerting header.

Preconfigured alerts

Anyscale automatically configures the following alerts:

AlertDescription
Autoscaler Pending NodesAlert when a node remains in the pending state for more than 15 minutes.
CPU UsageAlert when CPU usage on any node exceeds 80% for more than 5 minutes.
P50 Request LatencyAlert when the median latency for requests exceeds 60 seconds for more than 5 minutes by route.
Queued RequestsAlert when queued requests exceed 50 per minute within a 5-minute window.
Serve 4XX ResponsesAlert when the number of 4XX HTTP responses, or "client errors," reaches 5 within a 5-minute window.
Serve 5XX ResponsesAlert when the number of 5XX HTTP responses, or "server errors," reaches 2 within a 5-minute window.

Set up alerting channels

To receive alerts, you need to configure a notification channel and select the specific alerts you would like to receive a notification for.

note

Anyscale doesn't support for email notifications through Grafana. If this is a feature you would like, contact the support team at endpoints-help@anyscale.com.

To create the notifications, follow the steps below:

  1. Obtain your organization's PagerDuty integration key.
  2. Click on Manage Alerts on the Endpoints Server overview page.
  3. Select the Notification Channels tab and configure a PagerDuty notification channel using the integration key.
  4. Navigate back to Alert Rules and enable the desired alerts by clicking Edit Alert and adding the notification channel.

Manually configure panels and alerts

You can customize panels and alerts in the Endpoints Server Grafana dashboard to suit your use cases. For panels, access the Grafana Dashboard and select the edit button. For alerts, navigate to the Alert Rules page and update the query and alert rules. These settings persist for the duration of the Endpoint Server's life.