Skip to main content

LLMs and agentic AI on Anyscale

The Anyscale platform provides a comprehensive, end-to-end ecosystem for developing and deploying large language model (LLM) applications in production. Powered by the Ray distributed computing framework, Anyscale enables organizations to efficiently serve, fine-tune, and run batch inference on open source LLMs at scale, as well as building RAG and agentic applications with MCP. This document provides a high-level overview of these core capabilities.

LLM serving

Anyscale offers a robust and scalable solution for deploying LLMs to handle real-time user requests with low latency and high throughput. Anyscale combines Ray Serve for orchestration with vLLM for high-performance inference.

Anyscale provides a managed, cost-effective, and performance-optimized serving infrastructure. The following table outlines the features that allow Anyscale to solve the primary challenges of LLM serving, which include managing GPU memory, minimizing latency, and ensuring scalability:

FeatureDescription
Flexible orchestration with Ray ServeRay Serve acts as the scalable orchestration layer, simplifying production deployment.
Automatic scalingAutomatically scales model replicas up and down (including to zero) based on traffic, optimizing GPU utilization and minimizing costs.
High-performance backendsIntegrates with state-of-the-art inference engines like vLLM, which use techniques like PagedAttention and continuous batching to maximize throughput and manage GPU memory effectively.
Unified multi-model deploymentServes multiple models or model variants from a single deployment, simplifying management.
Dynamic multi-LoRA supportAllows a single base model to serve requests for many different fine-tuned adapters simultaneously.
OpenAI-compatible APIExposes a familiar API endpoint, allowing for seamless integration with existing applications built for OpenAI models.

To learn more, see the following:

LLM post-training and fine-tuning

Anyscale provides a powerful and flexible environment for adapting pre-trained foundation models to specific domains, tasks, or behavioral requirements.

Anyscale supports parameter-efficient fine-tuning (PEFT) such as LoRA and QLoRA. You can distribute training for full fine-tuning with Fully Shared Data Parallelism (FSDP) or DeepSpeed. This flexibility lets teams optimize for cost, speed, or maximum model control.

The following table provides an overview of methodologies for fine-tuning supported on Anyscale:

MethodologyDescription
Supervised fine-tuning (SFT)The basics of supervised fine-tuning include the following:
  • Training a model to follow instructions using labeled prompt-response pairs.
  • Adding domain-specific knowledge.
Preference tuning and alignmentTo align model behavior with human preferences, Anyscale supports traditional reinforcement learning from human feedback (RLHF) with proximal policy optimization (PPO). You can also use the following stable direct preference optimization algorithms:
  • DPO (direct preference optimization)
  • ORPO (odds-ratio preference optimization)
  • KTO (Kahneman-Tversky optimization)

To learn more, see the following:

LLM batch inference

For offline tasks like data enrichment, analysis, or model evaluation on large datasets, Anyscale leverages ray.data.llm to provide scalable and efficient batch inference.

The following table provides an overview of features that help scale batch inference on Anyscale:

FeatureDescription
Scalable data processing with Ray DataRay Data is a distributed data processing library that seamlessly scales LLM inference across a cluster of machines. It handles the parallelism and resource management required to process terabytes of data.
Optimized inference integrationThe ray.data.llm module integrates directly with high-performance engines like vLLM, bringing the same optimizations from real-time serving (for example, continuous batching) to the batch context for maximum efficiency.
Simplified APIThe build_llm_processor API helps you easily configure and run inference jobs. You can target models running locally on the cluster using vLLM or query external models using an OpenAI-compatible API endpoint.
Support for multimodal modelsRay Data LLM handles both text-only and vision-language models (VLMs), enabling batch processing of datasets containing text and images.

To learn more, see the following:

Retrieval-augmented generation (RAG)

Anyscale provides a unified platform to build, deploy, and scale end-to-end RAG pipelines, which ground LLMs in external knowledge sources to provide up-to-date, accurate, and attributable responses. The following table provides an overview of capabilities:

CapabilityDescription
Scalable data ingestion and indexingRAG pipelines begin with an offline data processing stage to create a vector index. Anyscale processes document corpuses with Ray Data, which handles chunking, embedding generation, and writing to vector databases in parallel.
High-performance online servingYou use Anyscale services to deploy a low-latency LLM online component to retrieve context and generate an answer. This allows both the retriever and the generator LLM to scale independently based on request volume.
End-to-end unified platformAnyscale uses the same framework to build and manage both the batch-oriented indexing pipeline and the latency-sensitive serving application. This simplifies the MLOps lifecycle, reduces system complexity, and accelerates development.

To learn more, see the following:

Agents and MCP (Model Context Protocol)

Anyscale is an ideal platform for building and operating sophisticated LLM agents, which use an LLM as a reasoning engine to plan and execute complex tasks by interacting with tools and external environments. The following table provides an overview of capabilities:

CapabilityDescription
Complex agent orchestrationAgentic workflows are often dynamic and involve complex operations (for example, LLM calls, tool execution, state updates). Ray's distributed task and actor framework provides a natural and powerful way to orchestrate these workflows with arbitrary dependencies.
Scalable MCP deploymentsYou can package the tools an agent relies on (for example, code interpreters, database query engines, APIs) as independent microservices and deploy at scale using Ray Serve. By wrapping an MCP server in a Ray Serve deployment, each tool gains production-grade capabilities such as automatic autoscaling to handle variable loads, load balancing across replicas, and fault tolerance for high availability.

To learn more, see the following:

Other generative AI capabilities

Beyond text-based LLMs, the Anyscale platform is a versatile solution for a wide range of generative AI workloads, leveraging its scalable compute infrastructure for different data modalities and model architectures. The following table provides an overview of capabilities:

CapabilityDescription
Multimodal pipelinesBuild complex, multi-stage pipelines that combine different models and data types. For example, use the Whisper model for scalable audio transcription and then pipe the results to an LLM for summarization, analysis, or content moderation (LLM-as-a-judge).
Image generationAnyscale provides the raw GPU power needed for demanding image generation workloads. The platform supports both fine-tuning diffusion models such as Stable Diffusion on custom datasets and the massive computational task of pre-training these models from scratch.

To learn more, see the following: