Skip to main content

Serve LLMs with Anyscale services

This page provides an overview of how you can use Anyscale services to deploy large language models (LLMs) in production to generate text, answer questions, and power intelligent workflows at scale.

Anyscale services provide you with a production-grade solution through three integrated components:

  • Ray Serve for orchestration and scaling.
  • vLLM for inference.
  • Anyscale for infrastructure management.

For an overview of LLM serving, see What is LLM serving?.

Common challenges for serving LLMs

The following table outlines several common challenges for serving LLMs:

ChallengeDescription
Managing GPU memoryLLM weights and the KV cache consume tens to hundreds of gigabytes. You can use techniques such as paged attention, quantization, and efficient memory sharing to prevent out-of-memory errors.
Minimizing latencyYour users expect fast, interactive responses. You can reduce time-to-first-token and subsequent token generation time with optimizations such as continuous batching, speculative decoding, and custom CUDA kernels.
Ensuring scalabilityProduction traffic patterns are unpredictable and bursty. Your serving solution must autoscale replicas quickly and reliably to meet demand without downtime.
Optimizing costGPUs represent significant infrastructure costs. You need to maximize hardware utilization and scale to zero during idle periods to control expenses.

Orchestration with Anyscale services

Anyscale services extend Ray Serve by providing a scalable model serving library built on the Ray distributed runtime. With Ray Serve as your orchestration layer, you benefit from the following:

  • Automatic scaling and load balancing: Automatically adds or removes model replicas based on real-time traffic patterns.
  • Unified multi-model deployment: Deploy, manage, and route traffic to multiple models using a single configuration.
  • OpenAI-compatible API: Use drop-in replacement endpoints for your existing OpenAI API clients.
  • Dynamic multi-LoRA support: Serve a single base model with different LoRA adapters attached at request time.

Inference with vLLM

Ray Serve integrates with vLLM for inference. Key vLLM optimizations include the following:

  • PagedAttention: Manages the KV cache as if it's virtual memory, eliminating memory fragmentation.
  • Continuous batching: Maximizes GPU utilization by continuously adding new requests to processing batches.
  • Optimized CUDA kernels: Integrates optimized kernels including FlashAttention and supports advanced quantization methods such as INT4 and FP8.

Managed infrastructure and enterprise-ready features

Anyscale provides the infrastructure and tools you need to run Ray Serve securely and cost-effectively at scale, including the following:

  • Managed infrastructure: Create optimized Ray clusters in your cloud account without manual provisioning.
  • Cost optimization: Use pay-as-you-go, autoscaling node pools that scale to zero when idle.
  • Enterprise security: Secure your deployments with private networking (VPC), single sign-on (SSO), and detailed audit logs.
  • Seamless scaling: Handle traffic spikes through Ray Autoscaler and pre-warmed instance pools.