Skip to main content

Troubleshoot Anyscale service failures and HTTP errors

Troubleshoot Anyscale service failures and HTTP errors

This article covers common failure states and HTTP errors you might encounter with Anyscale services, including how to identify them and what actions to take.

Service failure states

Anyscale services can enter several failure states. The service state appears in the Anyscale console and in CLI output from anyscale service status.

SYSTEM_FAILURE

A SYSTEM_FAILURE state indicates an unexpected error in the Anyscale control plane. This state typically isn't caused by your configuration or code.

Common symptoms include the following:

  • Multiple services enter SYSTEM_FAILURE at the same time.
  • The service event log shows "An unexpected system failure has occurred and manual intervention may be required."
  • The service event log shows unhealthy messages such as "deploying the most recent Ray Serve config failed" or "the Ray Serve REST API is unreachable or returned an unexpected error".
  • The head node ran out of disk space, preventing log and metric export.

A full head node disk can push a service into SYSTEM_FAILURE because the cluster can no longer export logs and metrics. If cluster metrics show the head node disk at capacity, redeploy the service with a larger instance type or reduce disk usage in your application.

For other SYSTEM_FAILURE cases, contact Anyscale support with the service ID. These failures typically require investigation by the Anyscale team.

USER_ERROR_FAILURE

A USER_ERROR_FAILURE state indicates a permissions or configuration error that prevents Anyscale from managing resources in your cloud account.

To troubleshoot a USER_ERROR_FAILURE, do the following:

  1. Check the service event log in the Anyscale console for specific error messages related to permissions.
  2. Run anyscale cloud verify to check your cloud configuration. See anyscale cloud verify.
  3. Fix any permission or configuration issues identified by the verification.
  4. Redeploy the service.

If the issue persists after verifying your cloud configuration, contact Anyscale support.

Stuck in STARTING

Services remain in the STARTING state until the primary version cluster starts and cloud networking resources are provisioned. These include TLS certificates, load balancers, and target groups.

The first deployment in a new cloud can take approximately 10 minutes while cloud resources provision. If a service remains in STARTING longer than expected, a permissions or configuration error might be blocking provisioning.

To troubleshoot, do the following:

  1. Check the service event log in the Anyscale console for error messages.
  2. Run anyscale cloud verify to validate your cloud configuration. See anyscale cloud verify.
  3. If no errors appear, contact Anyscale support with the service ID.

Stuck in TERMINATING

A service can become stuck in the TERMINATING state and can't be deleted. This can happen when underlying cluster resources are removed before the service properly terminates.

Symptoms include the following:

  • The service shows as "terminating" in the Anyscale console but never completes.

  • anyscale service terminate has no effect.

  • anyscale service delete fails with the error:

    Service must be in a TERMINATED state before it can be deleted.
  • anyscale cloud delete fails because the stuck service blocks cloud deletion.

Contact Anyscale support with the service ID. Resolving this state requires internal intervention.

HTTP 5xx errors

HTTP 502, 503, and 504 errors from an Anyscale service typically originate from the cloud load balancer or the serving infrastructure, not from your application code. HTTP 500 responses usually indicate unhandled exceptions in your application code. Check the Ray Serve logs for stack traces. For the other codes, the specific error helps narrow the cause.

For background on load balancer timeout defaults and configuring client-side retries, see Manage timeouts and retries for Anyscale services.

HTTP 502 Bad Gateway

A 502 response means the load balancer received an invalid response from the service.

Cause: The load balancer's idle timeout is longer than the keep-alive timeout in Ray Serve's HTTP server (uvicorn). When the server closes the connection before the load balancer expects, the load balancer returns a 502.

Solution: Set keep_alive_timeout_s in your Ray Serve HTTP options to a value longer than the load balancer's idle timeout:

CloudLoad balancer idle timeoutRecommended keep_alive_timeout_s
AWS300 seconds (5 minutes)Greater than 300 seconds
Google Cloud600 seconds (10 minutes)Greater than 600 seconds

Configure keep_alive_timeout_s in your service config:

http_options:
keep_alive_timeout_s: 650

HTTP 503 Service Unavailable

A 503 response means no healthy targets are available to handle the request.

Any of the following can cause a 503 response:

  • The load balancer target group is empty because no cluster instances are running.
  • None of the targets in the load balancer target group are healthy.
  • On Google Cloud, 503 errors can occur for up to five minutes after a service first starts while health checks initialize.

To troubleshoot, do the following:

  1. Check that the service has running instances in the Anyscale console.
  2. Check the service event log for version health status. If a version is marked UNHEALTHY, review the Ray Serve logs for application errors.
  3. If instances are running and healthy but 503 errors persist, contact Anyscale support.

HTTP 504 Gateway Timeout

A 504 response means the request exceeded the load balancer's timeout threshold.

CloudDefault timeoutBehavior
AWS300 seconds (5 minutes)Times out if the response is idle (no data sent) for longer than the threshold.
Google Cloud600 seconds (10 minutes)Times out even if a response is in progress.
note

AWS and Google Cloud load balancers handle timeouts differently. AWS only times out on idle connections, while Google Cloud enforces an end-to-end timeout regardless of activity. Consider this difference if you're migrating between cloud providers.

To resolve 504 errors, take any of the following actions:

  1. If your application legitimately needs more time to respond, contact Anyscale support to request a timeout increase.
  2. If the timeout is unexpected, check your application for performance bottlenecks. See Monitor a service.
  3. Configure client-side timeouts and retries to handle transient timeout errors gracefully. See Load balancer timeouts.