Skip to main content

Batch LLM inference with RayLLM-Batch

note

APIs are in preview and subject to change.

Large Language Models (LLMs) have become increasingly essential in a wide array of apps due to their advanced language understanding and generation capabilities. In production environments, there are two primary approaches to deploying LLMs: online serving and offline inference. While online serving provides low-latency, real-time responses, offline inference offers higher throughput and greater cost-effectiveness by optimizing GPU resource utilization. Consequently, for tasks that don't require immediate interaction, offline inference enables a more efficient and economical use of LLMs.

RayLLM-Batch is a library for high-throughput LLM batch inference. Deploying open-weight LLMs on Anyscale for batch processing offers several advantages over using closed-weight models:

  1. Cost-effectiveness: You have the flexibility to select any open-weight LLM that suits your use case. By batching inference with smaller LLMs on commodity GPUs, you can significantly reduce costs.
  2. Reliability: Hosting LLMs on the same cluster as your data processing pipeline eliminates network delays and instability associated with querying public endpoints, enhancing reliability, and reducing operational costs.
  3. Improved alignment: If open-weight LLMs don't meet your quality requirements, you can easily fine-tune the models using Anyscale LLMForge to better suit your specific use case.
  4. No vendor lock-In: You can customize RayLLM-Batch data processing pipelines to seamlessly integrated into your existing data workflows without additional burden.

What is RayLLM-Batch?

RayLLM-Batch is a library that implements a data pipeline using Ray Data for open-weight LLM batch inference, and provides out-of-the-box autoscaling, observability, fault-tolerance, etc. RayLLM-Batch supports a number of important features, including:

  • Custom workload: RayLLM-Batch allows you to custom a workload that loads and parses your own dataset in any format, including images.
  • Bring any custom models: RayLLM-Batch supports all popular open-weight LLMs as well as their fine-tuned and LoRA adaptions.
  • Checkpointing (fault tolerance): RayLLM-Batch uploads partial results to the cloud during the inference process, and could skip the finished tasks when resuming the pipeline from failure.
  • Automatic LLM engine optimization (in private preview): RayLLM-Batch offers an efficient auto-tuning mechanism that optimizes LLM engine throughput specifically for your workloads.

Quickstart

You can start from this workspace template which has all the dependencies installed. Follow the template instructions to get your service up and running.

The following example uses Llama-3.1-8B-Instruct in FP8 to batch processing CNN daily summary workloads on one L4 GPU. The example samples 0.1% of the dataset. Note that the example doesn't upload the results.

from rayllm_batch import RayLLMBatch
from rayllm_batch.workload import CNNDailySummary

# Initialize a workload.
workload = CNNDailySummary(dataset_fraction=0.001)
# Initialize a batch inference pipeline.
pipeline = RayLLMBatch(
"examples/configs/vllm-llama-3.1-8b-fp8-l4.yaml",
workload,
num_replicas=1,
batch_size=None,
)
# Run the batch inference pipeline, which includes the following tasks:
# 1. Load and parse data from the dataset.
# 2. Tokenize the dataset.
# 3. Run batch inference with an LLM.
# 4. Detokenize the results.
# Note that by specifying `output_path=...`, you can write results to S3 or local disk.
ds = pipeline.run()

Workload configure

You can customize your workload that specifies the data loading and parsing logic. The basic interface to define a workload is as follows:

@dataclass
class MyChatWorkload(ChatWorkloadBase):
dataset_file: Optional[str] = "/path/to/dataset.jsonl"
dataset_fraction: float = 1.0
# Sampling parameters such as max_tokens, temperature, etc.
sampling_params: Dict[str, Any] = field(
default_factory=lambda: {"max_tokens": 200, "ignore_eos": False}
)

def load_dataset(self) -> Dataset:
"""Load dataset using Ray Data APIs."""

def parse_row(self, row: Dict[str, Any]) -> Dict[str, Any]:
"""Parse each row in the dataset to make them compatible with
OpenAI chat API messages. Specifically, the output row should only
include a single key "messages" with type List[Dict[str, Union[str, List[Dict]]]].
"""

Below is a simple example of defining a workload that asks LLMs to answer a simple math question.

from dataclasses import dataclass, field

import ray
from ray.data.dataset import Dataset
from rayllm_batch import ChatWorkloadBase

@dataclass
class MyChat(ChatWorkloadBase):
dataset_file: Optional[str] = None
dataset_fraction: float = 1.0
sampling_params: Dict[str, Any] = field(
default_factory=lambda: {"max_tokens": 20, "ignore_eos": False}
)

def load_dataset(self) -> Dataset:
import random

def _synthetic(batch):
return {
"messages": [
[
{
"role": "system",
"content": "You are a calculator. Your task is "
"to return the answer of the math question.",
},
{
"role": "user",
"content": f"{random.randint(100, 5000)}+{random.randint(100, 5000)}=?",
},
]
for _ in range(len(batch["id"]))
]
}

return ray.data.range(100).map_batches(_synthetic)

Batch engine configurations

You can also choose the model to use and manually set LLM engine configurations. Here's an example configuration of Llama-3.1-8B in FP8 on one L4 GPU.

Example config file for Llama-3.1-8B
# File name: vllm-llama-3.1-8b-fp8-l4.yaml

model: neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8
llm_engine: vllm
accelerator_type: L4
engine_kwargs:
tensor_parallel_size: 1
pipeline_parallel_size: 1
max_num_seqs: 224
use_v2_block_manager: True
enable_prefix_caching: False
preemption_mode: "recompute"
block_size: 16
kv_cache_dtype: "auto"
enforce_eager: False
gpu_memory_utilization: 0.95
enable_chunked_prefill: True
max_num_batched_tokens: 2048
max_seq_len_to_capture: 32768
runtime_env:
env_vars:
VLLM_ATTENTION_BACKEND: "FLASH_ATTN"
ENABLE_ANYSCALE_PREFIX_OPTIMIZATIONS: "0"

Since tuning LLM engine configurations can be time-consuming and require domain knowledge, Anyscale is testing an auto-tuner to rapidly locate near-optimal configurations. Contact Anyscale support if you're interested in getting access.

Data pipeline configure

Finally, you have the flexibility to customize the data processing pipeline by specifying the number of LLM engines and the batch size. The following example runs the data pipeline on a cluster with five LLM engines and sets the batch size to 256.

  • LLM engine: The data pipeline automatically launches LLM engines on the same or different nodes within the current Ray cluster. The LLM engine configuration defines the number of GPUs that each LLM engine uses—that is, the product of tensor_parallel_size and pipeline_parallel_size.
  • Batch size: A larger batch size typically results in higher throughput due to better GPU utilization. However, because the batch size determines the granularity of Ray Data checkpoints, a larger batch means you may lose more data when recovering from a failure.
pipeline = RayLLMBatch(
"examples/configs/vllm-llama-3.1-8b-fp8-l4.yaml",
workload,
num_replicas=5,
batch_size=256,
)