---
title: "Troubleshoot Anyscale service failures and HTTP errors"
description: "Troubleshoot common Anyscale service issues including failure states, stuck services, and HTTP 5xx errors."
---

# Troubleshoot Anyscale service failures and HTTP errors

This article covers common failure states and HTTP errors you might encounter with Anyscale services, including how to identify them and what actions to take.

## Service failure states

Anyscale services can enter several failure states. The service state appears in the Anyscale console and in CLI output from `anyscale service status`.

### SYSTEM\_FAILURE

A `SYSTEM_FAILURE` state indicates an unexpected error in the Anyscale control plane. This state typically isn't caused by your configuration or code.

Common symptoms include the following:

-   Multiple services enter `SYSTEM_FAILURE` at the same time.
-   The service event log shows "An unexpected system failure has occurred and manual intervention may be required."
-   The service event log shows unhealthy messages such as "deploying the most recent Ray Serve config failed" or "the Ray Serve REST API is unreachable or returned an unexpected error".
-   The head node ran out of disk space, preventing log and metric export.

A full head node disk can push a service into `SYSTEM_FAILURE` because the cluster can no longer export logs and metrics. If cluster metrics show the head node disk at capacity, redeploy the service with a larger instance type or reduce disk usage in your application.

For other `SYSTEM_FAILURE` cases, contact [Anyscale support](mailto:support@anyscale.com) with the service ID. These failures typically require investigation by the Anyscale team.

### USER\_ERROR\_FAILURE

A `USER_ERROR_FAILURE` state indicates a permissions or configuration error that prevents Anyscale from managing resources in your cloud account.

To troubleshoot a `USER_ERROR_FAILURE`, do the following:

1.  Check the service event log in the Anyscale console for specific error messages related to permissions.
2.  Run `anyscale cloud verify` to check your cloud configuration. See [`anyscale cloud verify`](/reference/cli/cloud.md#anyscale-cloud-verify).
3.  Fix any permission or configuration issues identified by the verification.
4.  Redeploy the service.

If the issue persists after verifying your cloud configuration, contact [Anyscale support](mailto:support@anyscale.com).

### Stuck in STARTING

Services remain in the `STARTING` state until the primary version cluster starts and cloud networking resources are provisioned. These include TLS certificates, load balancers, and target groups.

The first deployment in a new cloud can take approximately 10 minutes while cloud resources provision. If a service remains in `STARTING` longer than expected, a permissions or configuration error might be blocking provisioning.

To troubleshoot, do the following:

1.  Check the service event log in the Anyscale console for error messages.
2.  Run `anyscale cloud verify` to validate your cloud configuration. See [`anyscale cloud verify`](/reference/cli/cloud.md#anyscale-cloud-verify).
3.  If no errors appear, contact [Anyscale support](mailto:support@anyscale.com) with the service ID.

### Stuck in TERMINATING

A service can become stuck in the `TERMINATING` state and can't be deleted. This can happen when underlying cluster resources are removed before the service properly terminates.

Symptoms include the following:

-   The service shows as "terminating" in the Anyscale console but never completes.
    
-   `anyscale service terminate` has no effect.
    
-   `anyscale service delete` fails with the error:
    
    ```text
    Service must be in a TERMINATED state before it can be deleted.
    ```
    
-   `anyscale cloud delete` fails because the stuck service blocks cloud deletion.
    

Contact [Anyscale support](mailto:support@anyscale.com) with the service ID. Resolving this state requires internal intervention.

## HTTP 5xx errors

HTTP `502`, `503`, and `504` errors from an Anyscale service typically originate from the cloud load balancer or the serving infrastructure, not from your application code. HTTP `500` responses usually indicate unhandled exceptions in your application code. Check the Ray Serve logs for stack traces. For the other codes, the specific error helps narrow the cause.

For background on load balancer timeout defaults and configuring client-side retries, see [Manage timeouts and retries for Anyscale services](/services/retries-timeouts.md).

### HTTP 502 Bad Gateway

A `502` response means the load balancer received an invalid response from the service.

**Cause:** The load balancer's idle timeout is longer than the keep-alive timeout in Ray Serve's HTTP server (uvicorn). When the server closes the connection before the load balancer expects, the load balancer returns a `502`.

**Solution:** Set `keep_alive_timeout_s` in your Ray Serve HTTP options to a value longer than the load balancer's idle timeout:

| Cloud | Load balancer idle timeout | Recommended `keep_alive_timeout_s` |
| --- | --- | --- |
| AWS | 300 seconds (5 minutes) | Greater than 300 seconds |
| Google Cloud | 600 seconds (10 minutes) | Greater than 600 seconds |

Configure `keep_alive_timeout_s` in your service config:

```yaml
http_options:
  keep_alive_timeout_s: 650
```

### HTTP 503 Service Unavailable

A `503` response means no healthy targets are available to handle the request.

Any of the following can cause a `503` response:

-   The load balancer target group is empty because no cluster instances are running.
-   None of the targets in the load balancer target group are healthy.
-   On Google Cloud, `503` errors can occur for up to five minutes after a service first starts while health checks initialize.

To troubleshoot, do the following:

1.  Check that the service has running instances in the Anyscale console.
2.  Check the service event log for version health status. If a version is marked `UNHEALTHY`, review the Ray Serve logs for application errors.
3.  If instances are running and healthy but `503` errors persist, contact [Anyscale support](mailto:support@anyscale.com).

### HTTP 504 Gateway Timeout

A `504` response means the request exceeded the load balancer's timeout threshold.

| Cloud | Default timeout | Behavior |
| --- | --- | --- |
| AWS | 300 seconds (5 minutes) | Times out if the response is idle (no data sent) for longer than the threshold. |
| Google Cloud | 600 seconds (10 minutes) | Times out even if a response is in progress. |

:::note
AWS and Google Cloud load balancers handle timeouts differently. AWS only times out on idle connections, while Google Cloud enforces an end-to-end timeout regardless of activity. Consider this difference if you're migrating between cloud providers.
:::

To resolve `504` errors, take any of the following actions:

1.  If your application legitimately needs more time to respond, contact [Anyscale support](mailto:support@anyscale.com) to request a timeout increase.
2.  If the timeout is unexpected, check your application for performance bottlenecks. See [Monitor a service](/services/monitoring.md).
3.  Configure client-side timeouts and retries to handle transient timeout errors gracefully. See [Load balancer timeouts](/services/retries-timeouts.md#load-balancer).

## Related resources

-   [What are Anyscale services?](/services.md)
-   [Manage timeouts and retries for Anyscale services](/services/retries-timeouts.md)
-   [Monitor a service](/services/monitoring.md)
-   [Introduction to Anyscale clouds](/clouds.md)
-   [Ray Serve production guide](https://docs.ray.io/en/latest/serve/production-guide/best-practices.html)

---

Previous: [VS Code not loading in workspace over VPN (Chrome)](/kb/vscode-vpn-chrome-loading.md) | Next: [Troubleshoot head node eviction](/kb/k8s-head-node-eviction.md)