Skip to main content

Develop a Ray Serve application

Develop a Ray Serve application

This page introduces the core concepts for developing Ray Serve applications on Anyscale. You learn about the client-server model, key Ray Serve abstractions, and how client requirements shape your application design.

Online versus offline processing

Ray Serve provides the framework for online processing—serving real-time responses to client requests. In an online application, a client sends a request and waits for an immediate response. Examples include web APIs, chatbots, and real-time prediction services.

Ray Serve also supports near real-time asynchronous task processing, a form of online processing where your HTTP APIs stay responsive while the system performs work in the background. See Ray docs on asynchronous inference.

In contrast, offline processing involves batch workloads where immediate responses aren't required. For offline batch inference or data processing, use Ray Data and Anyscale jobs instead of Ray Serve. See Get started with jobs and the Ray Data documentation.

Ray Serve provides the framework and infrastructure for building scalable online applications that respond to client requests with low latency and high throughput.

The client-server model

When you build with Ray Serve, you're building the server side of a client-server architecture. Ray Serve provides the server-side framework—you write Python code that defines how to handle requests, and Ray Serve manages routing those requests to your code, scaling up or down based on load, and recovering from failures.

Your Ray Serve application exists to serve clients—the systems or users that send requests and expect responses. Understanding your client requirements is essential because those requirements drive nearly every design decision in your application.

Common client types include the following:

  • Web and mobile applications: End users making API calls through a browser or mobile app expect low latency, typically under 200ms, and require response formats such as JSON or streamed text.
  • Microservices: Other services in a larger system architecture might batch requests, require specific retry behavior, or need responses in specific data formats.
  • LLM agents and agentic systems: Agent frameworks often require streaming responses, tool calling interfaces, and support for conversational context.
  • Batch processing systems: Automated systems querying for predictions might prioritize throughput over latency and send large batches of requests.
tip

If you're building a drop-in replacement for an existing API or service, your downstream constraints are already defined. Your Ray Serve application must match the expected request and response formats, latency requirements, and authentication patterns of the system you're replacing.

Ray Serve handles the server-side complexity:

  • Request routing: Distributing incoming requests across multiple instances, called replicas, of your code.
  • Autoscaling: Automatically adding or removing replicas based on request load.
  • Load balancing: Ensuring even distribution of work across available replicas.
  • Fault tolerance: Detecting failed replicas and routing requests to healthy instances.

You focus on implementing the business logic—loading models, processing requests, and generating responses.

Key Ray Serve concepts

Ray Serve uses three core abstractions to define scalable applications: deployments, replicas, and applications.

Deployment

A deployment is the fundamental building block of Ray Serve. A deployment wraps a Python class that handles requests. You mark a class as a deployment using the @serve.deployment decorator:

from ray import serve

@serve.deployment
class TextClassifier:
def __init__(self, model_path: str):
# Load model in the constructor
self.model = load_model(model_path)

def __call__(self, text: str):
# Process requests in the __call__ method
return self.model.predict(text)

The __init__ method runs once when the deployment starts, typically loading models or initializing resources. The __call__ method handles each incoming request.

See Deployment in the Ray documentation for more details.

Replica

A replica is a running instance of a deployment. Ray Serve can run multiple replicas of the same deployment to handle more requests in parallel. Each replica is a separate Python process with its own copy of the model and resources.

You configure the number of replicas based on your load requirements. When you enable autoscaling, Ray Serve automatically scales replicas up or down based on traffic. See Autoscaling in the Ray documentation.

Application

An application groups one or more deployments and defines the ingress, or entry point, that handles incoming traffic. An application can be a single deployment serving one model, or multiple deployments composed together to build complex inference pipelines.

# Single deployment application
classifier = TextClassifier.bind(model_path="./model")
serve.run(classifier)

# Multi-deployment application (model composition)
preprocessor = Preprocessor.bind()
model = Model.bind()
postprocessor = Postprocessor.bind(preprocessor, model)
serve.run(postprocessor)

See Application in the Ray documentation for more details.

Model composition

For complex applications, you can compose multiple deployments into a pipeline using DeploymentHandle objects. You can configure and scale each deployment in the pipeline independently.

from ray import serve
from ray.serve.handle import DeploymentHandle

@serve.deployment
class Preprocessor:
def preprocess(self, text: str):
return text.lower().strip()

@serve.deployment
class Model:
def __init__(self):
self.model = load_model()

def predict(self, text: str):
return self.model(text)

@serve.deployment
class Pipeline:
def __init__(self, preprocessor: DeploymentHandle, model: DeploymentHandle):
self.preprocessor = preprocessor
self.model = model

async def __call__(self, text: str):
# Call deployments in sequence
processed = await self.preprocessor.preprocess.remote(text)
result = await self.model.predict.remote(processed)
return result

# Bind and compose deployments
preprocessor = Preprocessor.bind()
model = Model.bind()
pipeline = Pipeline.bind(preprocessor, model)

See Deploy model composition in the Ray documentation for more details. If you're passing large objects between deployments and using high-throughput serving, see Passing large objects between deployments.

Design considerations

Client requirements directly influence how you design your Ray Serve application. The following table shows common client needs and how they map to Ray Serve design decisions:

Client requirementRay Serve design consideration
Low latency, for example under 200ms for real-time APIs
  • Use single-model deployments to minimize overhead.
  • Allocate sufficient CPU/GPU resources per replica.
  • Configure max_ongoing_requests to avoid overwhelming CPU and memory.
  • Consider fractional GPU allocation for smaller models.
High throughput, for example thousands of requests per second
  • Enable high-throughput serving optimizations. See High-throughput serving.
  • Enable autoscaling with appropriate target_ongoing_requests.
  • Configure dynamic request batching to process multiple requests together.
  • Use multiple replicas distributed across nodes.
Streaming responses, for example LLM chat applications
  • Use async def methods and yield for incremental responses.
  • Configure appropriate timeouts for long-running streams.
  • See Streaming responses in the Ray documentation.
Variable traffic patterns, for example unpredictable spikes
  • Enable autoscaling with a suitable min/max replica range.
  • Set downscale_delay_s to avoid rapid scaling oscillations.
  • Monitor queue depth and adjust target_ongoing_requests.
Complex inference pipelines, for example multi-model workflows
  • Use model composition with DeploymentHandle objects.
  • Scale each stage independently based on computational cost.
  • Consider caching intermediate results if applicable.
  • If using high-throughput serving with large objects, see Passing large objects between deployments.
Many models, for example per-tenant or per-user models

Pattern: Model multiplexing with high-throughput serving

If you need to serve many models with similar shapes but different weights, Anyscale recommends using model multiplexing on downstream deployments. This pattern is compatible with high-throughput serving and allows efficient resource utilization. See Design considerations for other common patterns.

The key principle is that the ingress deployment should handle simple request routing, while downstream deployments handle model loading and multiplexing logic.

Structure your application so the ingress deployment routes requests to a downstream deployment that manages model multiplexing:

from ray import serve
from ray.serve.handle import DeploymentHandle
from fastapi import FastAPI, Request

app = FastAPI()

@serve.deployment
@serve.ingress(app)
class Ingress:
def __init__(self, router: DeploymentHandle):
self.router = router

@app.post("/{model_id}/predict")
async def predict(self, model_id: str, request: Request):
# Simple ingress - route to downstream deployment.
data = await request.json()
return await self.router.route.remote(model_id, data)

@serve.deployment
class ModelRouter:
def __init__(self):
self.models = {}

def load_model(self, model_id: str):
# Load model on demand and cache it.
if model_id not in self.models:
self.models[model_id] = load_model(model_id)
return self.models[model_id]

async def route(self, model_id: str, data: dict):
# Model multiplexing happens in downstream deployment.
model = self.load_model(model_id)
return model.predict(data)

# Compose deployments.
router = ModelRouter.bind()
ingress = Ingress.bind(router)

This pattern separates concerns:

  • Ingress deployment: Handles HTTP requests and routes to appropriate downstream deployments.
  • Downstream deployment: Manages model loading, caching, and inference.

You can scale the ingress and router deployments independently based on their resource requirements. See Model multiplexing in the Ray documentation for additional configuration options.

Development workflow on Anyscale

Anyscale provides a streamlined workflow for developing and deploying Ray Serve applications:

  1. Develop in workspaces: Use Anyscale workspaces for interactive development. Workspaces provide Jupyter notebooks, VS Code integration, and direct access to Ray clusters for testing. See Workspaces.

  2. Deploy to services: Once your application is ready, deploy it to production using Anyscale services. Services provide production features such as load balancing, autoscaling, zero-downtime updates, and monitoring. See What are Anyscale services?.

  3. Iterate with in-place updates: During development, use in-place service updates to quickly test changes without redeploying the entire service. See Update a service in-place.

This workflow separates the development environment with workspaces from the production environment with services, allowing you to iterate quickly while maintaining production stability.

Next steps

Now that you understand the core concepts, explore the following resources: